Groq founder and CEO Jonathan Ross has likened NVIDIA GPUs to "18-wheel semi-trucks" and positioned his company's LPU (Language Processing Unit) as a "last-mile delivery van," arguing that combining the two achieves the optimal balance of cost and speed for large language model inference tasks.
In a recent interview, Ross detailed this architectural division of labor. The prefill stage, which involves reading input text, is highly parallelizable and insensitive to per-token latency, making it ideally suited for GPU processing. The decoding stage, however, can be configured flexibly based on user sensitivity to speed and cost, ranging from pure GPU, to a GPU-LPU hybrid, to pure LPU. He stated that the LPU, with its all-on-chip SRAM architecture and static scheduling mechanism, holds a significant advantage in low-latency, small-batch decoding scenarios, making it particularly well-suited for today's prevalent mixture-of-experts (MoE) models.
Against the backdrop of the rapid rise of agentic AI applications, task decomposition models where multiple AI models call each other are driving compute demand to expand exponentially rather than linearly. Citing Jevons' paradox, Ross noted that a decrease in the unit cost of compute power does not shrink the market size but instead continuously stimulates total demand growth—the market spaces for GPUs and LPUs are inherently co-expanding, not engaged in zero-sum competition.
This also provides a lens for understanding the strategic rationale behind Groq's $20 billion cooperation agreement with NVIDIA. For inference workloads, the products from the two companies play different roles, and their combined deployment is superior to using either one alone.
LPU and GPU: Complementary Positions on the Pareto Curve
Jonathan Ross pointed out that the per-token cost curves for GPUs and LPUs are distinctly different, indicating they are not in direct competition but rather cover different performance ranges.
"If you only want the lowest per-token cost, you use a GPU with a very large batch size, but it will be slower," he said. "The advantage of the LPU is its ability to scale across multiple chips, relying entirely on high-speed SRAM instead of external memory, significantly increasing token generation speed without a corresponding significant cost increase."
He stated that on the high-speed end of the Pareto curve, the LPU is more economical than the GPU. Combining the two allows for achieving the optimal per-token cost and maximum compute capacity at any targeted speed.
The LPU is especially friendly to mixture-of-experts models. Ross explained that GPUs require batch sizes in the hundreds to be economical when reading data from DRAM, whereas LPUs can operate effectively with batch sizes around 10. This translates to lower wait latency and higher execution efficiency. "The LPU is almost tailor-made for expert models."
Static Scheduling and MoE: The Inference Benefits of a Deterministic Architecture
Another core differentiator for Groq is static scheduling—where the order of operations is predetermined at compile time rather than dynamically allocated at runtime.
Ross used a calendar analogy: short meetings require precise scheduling, while long meetings can be more flexible. "In an inference scenario, you're doing ultra-low latency, small-batch computation, so you need to schedule all operations in advance, letting each piece of computation finish quickly and release the hardware promptly for the next step. This is less critical during training but absolutely key for inference."
He also clarified that static scheduling does not mean an inability to adapt to dynamic routing. In an MoE architecture, the LPU's time slots are fixed, but "who to meet with"—that is, which expert's weights to activate—can change at runtime, achieving flexible routing through scatter-and-gather capabilities.
Collaboration with NVIDIA: Prefill for GPU, Decoding Depends on Scenario
Following the $20 billion strategic cooperation agreement with NVIDIA, Ross described the specific division of labor between the two in the inference pipeline.
"The prefill stage—the stage of reading the input text—we recommend running entirely on the GPU, because this stage is highly parallelizable, which GPUs are very good at," he said. The decoding stage is configured in tiers based on user needs: cost-sensitive users might use GPU-only decoding; paying professional users could adopt a GPU-LPU combination; and extreme performance scenarios might consider pure LPU decoding.
He anticipates that the market will see more hybrid deployment forms combining LPUs and GPUs, rather than just Groq chips sold separately. "Combining the two is like using 18-wheelers and delivery vans together; you can build a better network."
Jevons' Paradox: The Cheaper Compute Gets, the Greater the Demand
Regarding the long-term trajectory of the AI compute market, Ross invoked the 19th-century economic concept of "Jevons' paradox": a decrease in the unit cost of compute power will not reduce total demand but will instead spur even greater demand.
"Jevons' paradox comes from a treatise on coal: whenever the efficiency of steam engines improved, total coal consumption increased," he said. "When the cost of an activity decreases, previously unprofitable activities become viable, and people are willing to do more experiments. As AI becomes cheaper, the demand for AI will only increase."
He further noted that agent architectures will amplify this effect. AI breaking down tasks into parallel subtasks, having multiple agents work simultaneously, and the multi-layered nesting of AI calling other AI will lead to an exponential expansion in compute usage. "AI using AI using AI leads to an exponential explosion in usage."
Ross's conclusion is that "a success disaster" is inevitable—the more compute power Groq and NVIDIA provide to the market, the more the market will want.