AI is transitioning from "generating information" to "executing tasks," with low-latency, high-throughput inference scenarios, exemplified by coding agents, opening the next critical phase of commercialization for AI infrastructure. On the supply side, power, chips, and data center construction are all operating with minimal redundancy, suggesting a prolonged period of tight supply-demand balance as the new normal for the industry.
Following his keynote at GTC 2026, NVIDIA CEO Jensen Huang participated in an interview with Ben Thompson, founder of Stratechery, offering a comprehensive perspective on core issues including the AI inference economy, CPU strategy, the rationale behind acquiring Groq, and ongoing supply chain constraints.
Huang pointed out that AI crossed a crucial threshold over the past year—advancements in inference capabilities have enabled models to generate tangible economic value for the first time, with the rapid emergence of programming agents being the clearest indicator of this shift. NVIDIA has formally integrated ultra-fast, low-latency inference into its product portfolio.
Regarding supply, Huang stated frankly that "almost every link in the chain is tight," noting that neither power nor chip supply can be easily doubled. While NVIDIA has planned its supply chain for "this year and next," he expressed a stronger desire for faster deployment of "land, power, and data center buildings," as this will directly influence the pace of computing expansion and the realization of capital expenditure plans.
**The Inference Economy: Low Latency as the Next Paid Service Engine** Huang identified the maturation of "inference" capabilities as the core breakthrough in AI development over the past year. He explained that while early generative AI faced commercialization challenges due to hallucination issues, the integration of inference allows models to achieve practical application through reflection, retrieval, and search, thereby elevating their function from providing information to genuinely completing tasks.
"Search is a service nobody pays for, because the barrier to obtaining information isn't high enough to warrant payment," Huang said. "We have now crossed that threshold—AI can not only converse with people but also perform tasks for them."
He cited programming as the most typical example. Code generation is not a standard language modality; it requires models to reflect holistically on code blocks and validate execution results. The maturation of this capability allows engineers to shift their focus from writing code line-by-line to architectural and specification design.
He revealed that 100% of NVIDIA's internal software engineers now use coding agents, adding that "many haven't written a single line of code themselves for some time, yet their productivity is extremely high."
Based on this assessment, NVIDIA decided to incorporate low-latency inference capabilities into its product line. Huang explained that existing GPU systems face an inherent tension between maximizing throughput and maximizing the quality of intelligent tokens. For users of high-value coding agents, they are willing to pay a premium for a tenfold increase in token generation speed.
"If Anthropic launched a Claude Code service layer that made programming ten times faster, I would pay for it, no question. I am building this product for myself," he stated.
**Acquiring Groq: A Strategic Move to Deconstruct the Inference Pipeline** Huang views the decision to acquire Groq not as a sudden move but as a natural extension of NVIDIA's multi-year strategy in inference infrastructure.
He indicated that when NVIDIA released its Dynamo inference scheduling framework a year ago, it was already contemplating how to deconstruct the inference process more granularly across heterogeneous infrastructure. Collaboration with Groq began approximately six months prior to the acquisition announcement. The core of the deal was to acquire Groq's team and technology licenses, not its cloud service business.
Technically, NVIDIA plans to extend the deconstruction of the inference pipeline into the decoding phase itself. The Vera Rubin GPU will handle high-FLOP attention computations, while Groq's LPU architecture will manage the parts requiring extremely high token rates and ultra-low latency. Products based on this approach are planned for release within the year.
He elaborated, "But if your business is like Anthropic or OpenAI, where Codex is generating real economic value and you want to generate more tokens, then integrating this accelerator can significantly boost revenue."
He concurrently acknowledged that this solution isn't suitable for all customers. For platforms primarily serving free users with low conversion rates to paid services, integrating Groq's technology would add cost and complexity without being cost-effective.
Huang compared the Groq acquisition to the earlier acquisition of Mellanox, noting that both represent NVIDIA's consistent logic of incorporating external, specialized architectures into its own computing stack to achieve system-level synergistic optimization. "NVIDIA is an accelerated computing company, not a GPU company. We are not fixated on where the computation happens; we just want to accelerate applications."
**CPU Strategy: Redefining Server Architecture for the AI Agent Era** Against the backdrop of the long-standing external perception of NVIDIA as a GPU company, Huang systematically explained the rationale behind NVIDIA's entry into the CPU market and outlined the design philosophy of its in-house Vera CPU.
He pointed out that over the past decade, CPU design has been optimized for hyperscale cloud computing—aiming to maximize the number of rentable cores, with single-thread performance not being the top priority. However, in AI agent scenarios, while the GPU waits for tool call returns, the single-thread performance of the CPU directly determines the overall system efficiency. "You never want GPU time sitting idle," he said.
The core differentiation of the Vera CPU lies in its memory bandwidth and I/O bandwidth: its bandwidth per CPU core is three times that of any current CPU, specifically designed to avoid bottlenecking the GPU with I/O limitations. He also mentioned collaboration with Intel on NVLink to meet the enterprise computing market's need for continuity within the x86 ecosystem.
Huang categorized tool usage for AI agents into two types: structured tools, including CLI, API, and database queries; and unstructured tools, involving PC applications where the model operates a web interface through multimodal perception. NVIDIA is developing capabilities for both paths.
**Tight Supply-Demand Balance: Strain on Both Power and Chip Capacity** Addressing the market's ongoing concerns about AI computing supply, Huang offered one of his most direct assessments to date: both power and chip manufacturing capacity are in a tight balance, with no immediate potential for doubling supply in either area.
"I don't believe we have twice the power we need, nor do I believe we have twice the chip supply we need; there is no two-fold redundancy in any aspect," he stated. "But based on the visibility I have today, our supply chain is capable of supporting the demand."
He mentioned that NVIDIA works with approximately two hundred long-term partners in its supply chain and has conducted advanced planning across the upstream and downstream sectors, expressing optimism about supporting significant growth this year and next.
However, he admitted that the biggest bottleneck currently might not be the chips themselves, but rather the speed of deploying data center land, power, and construction. "What I probably wish for most is that this infrastructure could be completed faster," he added.
When asked if NVIDIA is the primary beneficiary of the computing power scarcity, Huang acknowledged that the company is the largest and has the most prepared supply chain, but attributed this to long-term planning rather than偶然的 windfalls from the market structure.