NVIDIA Diversifies Product Lines to Address Both AI Training and Inference Demands as CSPs Expand Custom ASIC Development

According to the latest AI server research from TrendForce, as major cloud service providers (CSPs) intensify their efforts in developing custom chips, NVIDIA has shifted its focus at the GTC 2026 event towards the implementation of AI inference applications across various sectors. This marks a departure from its previous concentration on the cloud AI training market. By promoting a diverse product portfolio including GPUs, CPUs, and LPUs to separately address AI training and inference demands, and by driving supply chain growth through integrated rack solutions, NVIDIA is adapting its strategy. TrendForce indicates that with the expansion of custom chip initiatives led by CSPs such as Google and Amazon, the proportion of ASIC AI servers in total AI server shipments is projected to increase from 27.8% in 2026 to nearly 40% by 2030.

To reinforce its leadership in the AI market, NVIDIA is actively promoting integrated rack-scale solutions like the GB300 and VR200, which combine CPUs and GPUs, emphasizing their scalability for AI inference applications. The Vera Rubin system introduced at GTC is defined as a highly vertically integrated complete system, encompassing seven chip types and five rack configurations. Regarding the Vera Rubin supply chain timeline, memory manufacturers are expected to supply HBM4 for the Rubin GPU by the second quarter of 2026, supporting NVIDIA's planned shipment of Rubin chips around the third quarter.

For the GB300 and VR200 rack systems, the GB300 has already replaced the GB200 as the primary product in the fourth quarter of 2025, with its shipment share estimated to reach nearly 80% by 2026. The VR200 rack is anticipated to gradually ramp up shipment volumes towards the end of the third quarter of 2026, though subsequent developments will depend on actual progress by ODMs.

Furthermore, as AI transitions from generative to agent-based models, the decoding phase for token generation faces significant latency and memory bandwidth bottlenecks. In response, NVIDIA, integrating technology from the Groq team, has introduced the Groq 3 LPU, specifically designed for low-latency inference. A single Groq 3 chip incorporates 500MB of SRAM, with a full rack supporting up to 128GB. However, the LPU's inherent memory capacity is insufficient to handle the massive parameters and KV cache required for systems like Vera Rubin. Consequently, NVIDIA proposed a "Disaggregated Inference" architecture at GTC. Utilizing an AI factory operating system called Dynamo, this architecture splits the inference pipeline into two parts: for agent-based AI, the Pre-fill and Attention computation stages, which involve heavy mathematical operations and store large KV caches, are handled by the high-throughput, large-memory Vera Rubin system. The decoding and token generation stages, which are bandwidth-constrained and highly sensitive to latency, are offloaded to LPU racks expanded with massive memory.

On the supply chain front, the third-generation Groq LP30, manufactured by Samsung, has entered full mass production and is scheduled for official shipment in the second half of 2026. Future plans include the launch of the higher-performance LP40 chip in the next-generation Feynman architecture.

免責聲明：投資有風險，本文並非投資建議，以上內容不應被視為任何金融產品的購買或出售要約、建議或邀請，作者或其他用戶的任何相關討論、評論或帖子也不應被視為此類內容。本文僅供一般參考，不考慮您的個人投資目標、財務狀況或需求。TTM對信息的準確性和完整性不承擔任何責任或保證，投資者應自行研究並在投資前尋求專業建議。

老虎證券

NVIDIA Diversifies Product Lines to Address Both AI Training and Inference Demands as CSPs Expand Custom ASIC Development

熱議股票