Decoding the Training and Inference Mechanics of Leading AI Models

A chalkboard and a set of equations were the tools used by chip engineer Reiner Pope to deconstruct the training and inference logic behind models like GPT-5, Claude, and Gemini. By analyzing publicly available API pricing, he reverse-engineered architectural details that major AI labs often keep confidential.

In a recent, unusually formatted deep-dive conversation, renowned tech podcast host Dwarkesh Patel engaged with Reiner Pope, CEO of chip startup MatX, in a session centered around chalkboard derivations. Pope, previously responsible for TPU architecture and compiler optimization at Google, is recognized as one of the few engineers with comprehensive expertise across the entire AI stack—from chip design to model architecture.

Using equations and diagrams on the chalkboard, Pope systematically broke down the underlying logic of cutting-edge large models, from training to inference. Patel noted that understanding these details makes it clear "why AI looks the way it does today—its architecture, pricing, and pace of progress—it all makes sense."

Key conclusions include: serving a single user request without batching can increase inference costs by up to 1000 times. Furthermore, the pre-training data volume for GPT-5 is approximately 100 times greater than the theoretically optimal amount suggested by scaling laws. Additionally, models like DeepSeek V3 utilize a Mixture-of-Experts (MoE) architecture with 256 experts, activating only a small subset (e.g., 32) during each inference step. A critical physical bottleneck for scaling such models is confining the MoE architecture within a single rack of 72 GPUs.

The size of a GPU rack fundamentally dictates the maximum scale of a model. To understand why top-tier models are built as they are, one must start with the hardware. Modern large models run inference on GPU clusters. Nvidia's Blackwell NVL72 represents a common deployment form—a single rack housing 72 GPUs interconnected via high-speed NVLink, allowing communication between any two GPUs with just two hops through a central switch, providing extremely high bandwidth.

However, communication speed drops by a factor of eight once traffic must move between racks. This "8x difference" directly caps the practical deployment scale for MoE models. For instance, DeepSeek V3's 256 experts, with only 32 activated per inference, naturally suggest an "expert parallelism" deployment—distributing different experts across different GPUs. This creates an "all-to-all" communication pattern where any GPU might send tokens to any other, perfectly aligning with the NVLink topology within a rack. But if experts are spread across two racks, half the token traffic would traverse the slower inter-rack network, becoming a immediate bottleneck. Pope stated, "The size of one rack limits how large your expert layer can be."

This insight helps explain a longstanding market puzzle: why did Gemini appear to achieve large-scale pre-training success earlier than other labs? Pope infers that Google's TPU systems historically featured larger scale-up domains, enabling efficient all-to-all communication across a broader range, allowing deployment of sparser MoE models while maintaining inference efficiency.

Batching is the secret to reducing costs by up to 1000 times. The interview also addressed a common market observation: products like Claude and Codex offer "fast mode" options, costing 6 times more for only a 2.5x speed increase. Why can't a "slow mode" be offered for a lower price? Pope's answer was direct: the core variable is batch size. He used a "train schedule" analogy: GPUs dispatch a "train" (execute one batch inference) roughly every 20 milliseconds. The number of "passengers" per train is the batch size.

The central conclusion is that the per-token inference cost is extremely high at small batch sizes, drops sharply as batch size increases, and eventually plateaus at a lower limit. This is due to the amortization of the fixed cost of loading model weights from memory (HBM) into the chip. This cost remains constant regardless of serving 1 or 2000 users; the weights are read only once. Serving a single user bears the full brunt of this cost, while serving 2000 users reduces the per-user cost to near insignificance. Pope estimates that without batching, costs could be 1000 times higher.

What is the optimal batch size? Pope provided a concise formula: approximately 300 multiplied by the model's sparsity ratio. For a model like DeepSeek, which activates 1/8 of its experts, the optimal batch size is around 2400 concurrent sequences. Notably, this number is independent of the model's total parameter count, depending only on hardware characteristics and sparsity—a counter-intuitive result.

So, can a "slow mode" be significantly cheaper? Mathematically, not really. The KV cache, which stores the conversation history for each user, cannot be shared or amortized across users. Therefore, making users wait longer does not substantially reduce costs. Pope said, "(Slow mode) doesn't save much because the KV cache is per-user, and the computation is per-user."

API pricing can be used to reverse-engineer model architecture. Pope demonstrated an impressive deductive process: inferring internal architectural parameters from public API pricing.

Clue 1: Gemini's price increases by 50% beyond 200,000 tokens. Why 50%? Why at 200,000 tokens? Pope explained this corresponds to the point where the memory bandwidth cost of the KV cache surpasses the computational cost of the weight matrices—the transition point where the model shifts from being "compute-bound" to "memory-bandwidth-bound." Using this number, and assuming around 100 billion activated parameters, he calculated that the KV cache occupies about 2 KB per token. This aligns closely with parameters described in public papers from organizations like Character.AI (e.g., 8 KV heads, dimension 128). "They are leaking a fair amount of information through their API pricing," Pope noted, adding that labs are incentivized to price close to cost to avoid being undercut by competitors.

Clue 2: Output tokens are 3-5 times more expensive than input tokens. The reason lies in the difference between prefill and decode phases. Prefill processes a large number of input tokens in parallel, operating efficiently near the compute bottleneck. Decode generates one token at a time, requiring reading the entire model weights and KV cache for each step, making it severely constrained by memory bandwidth. This price difference quantifies the extent of the memory bandwidth bottleneck in current top models.

Clue 3: Why cached tokens are much cheaper. APIs often offer significant discounts for "cache hits." Pope explained this reflects the cost difference between storing the KV cache in different memory hierarchies: recalculating it from scratch versus reading it directly from HBM, DDR, or flash storage. Based on the price difference between Gemini's "5-minute cache" and "1-hour cache" tiers, he inferred these likely correspond to flash storage and hard disk drives (HDDs) respectively—the latter surprising even Pope: "I didn't expect hard drives to be used here."

How much is GPT-5 overtrained? The answer is 100x. Perhaps the most striking calculation from the discussion revolved around optimal training data volume. Pope started from an economic intuition: overall efficiency is optimal when pre-training cost, reinforcement learning (RL) cost, and inference cost are roughly balanced. Expressing these costs, he found the variable for activated parameters cancels out—meaning the optimal training volume is independent of model size itself, depending only on inference traffic.

Plugging in estimated real-world numbers—inference traffic of about 50 million tokens per second, a model lifecycle of roughly 2 months, totaling around 200 trillion inference tokens—and comparing this to the Chinchilla-optimal training data (around 2 trillion tokens for ~100B activated parameters), the ratio is approximately 100:1. This suggests current top models are trained on about 100 times more data than what pure training efficiency would dictate. Patel commented, "We know this is roughly correct because there are rumors that GPT-5 was trained on about 150 trillion tokens, close to our calculated 200 trillion." Pope added that the core logic is: the compute spent serving users should roughly equal the compute spent training the model; otherwise, money is being wasted on one end. Patel summarized: "If GPT-5 were to be trained optimally, then the total number of tokens all users generate using it should equal the number of tokens consumed in pre-training—and that pre-training data would be roughly the sum total of human knowledge." Pope concurred, "Approximately, yes."

Pipeline parallelism sounds good but is often impractical for inference. Regarding pipeline parallelism—distributing different layers of a model across multiple racks for sequential execution—Pope concluded that while it saves memory capacity, it does not solve the KV cache problem, limiting its value in inference scenarios. Intuitively, pipeline parallelism requires maintaining multiple "in-flight" batches, causing the global batch size to grow with the number of pipeline stages. Although the weight storage per rack decreases, the total KV cache across all racks does not reduce because more concurrent sequences are needed to keep the pipeline full. "You cannot amortize the KV cache across pipeline stages, just as you cannot amortize it across batches," Pope summarized. This provides an engineering rationale for Ilya Sutskever's past comment that "by now everyone knows pipeline parallelism is unwise," which Patel referenced.

Convergent evolution between neural networks and cryptography. Towards the end, Pope discussed a concept from his blog: a "convergent evolution" exists between neural network architectures and cryptographic protocols. Both require thorough mixing of input information throughout the system—cryptography aims to make output resemble random noise, while neural networks aim to extract hidden high-level structures. Their goals are opposite: cryptography strives to destroy structure, whereas neural networks strive to discover it.

Pope cited a specific case of technical migration: the Feistel network, a cryptographic construct for making irreversible functions reversible, was introduced into neural networks in 2017, leading to "RevNets" (Reversible Networks). RevNets allow recalculating activations during the backward pass of training instead of storing them all, trading more computation for less memory usage. This stands in contrast to the KV cache logic, which trades more memory for less computation. Pope noted, "Trading memory for computation is usually economical under current hardware conditions."

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

Decoding the Training and Inference Mechanics of Leading AI Models

Most Discussed