NVIDIA's GTC 2026 Reveal: Three New Systems Redefining AI Infrastructure Limits

Deep News
03/24

At the GTC 2026 conference, NVIDIA introduced three new systems in one sweeping move: the Groq LPX inference rack, the Vera ETL256 CPU rack, and the STX storage reference architecture. This expansion signifies NVIDIA's strategic push to extend its product portfolio beyond GPU computing cores into low-latency inference, CPU orchestration, and the storage layer, systematically redefining the boundaries of AI infrastructure.

The Groq LPX system garnered the most market attention. This marks the first product launch from NVIDIA less than four months after completing a $20 billion intellectual property licensing agreement and acquiring Groq's core team.

The LPX rack deeply integrates Groq's LP30 chip with NVIDIA GPUs and introduces "Attention FFN Disaggregation" (AFD) technology. This technology specifically targets the compression of decoding latency in high-interaction inference scenarios, creating a previously unavailable optimization path for large-scale inference systems.

Simultaneously, the Vera ETL256 packs 256 CPUs into a single liquid-cooled rack, achieving full internal connectivity through a copper cable topology. This directly addresses the increasingly prominent CPU supply bottleneck as AI workloads scale. The STX, by providing a standardized storage reference architecture, formally extends NVIDIA's influence from the computing and network layers to the storage infrastructure layer.

Analysis indicates that these three systems collectively signal a unified strategic direction: NVIDIA is evolving beyond being merely a GPU supplier into a full-stack AI infrastructure platform provider. Its reach now covers areas previously dominated by other vendors, including inference optimization, CPU density, and storage orchestration, which is set to profoundly impact the competitive landscape of the entire AI hardware supply chain.

**LPX and LP30: Groq Architecture Officially Integrated into NVIDIA's Inference Stack** The transaction between NVIDIA and Groq was structured as an IP licensing and talent acquisition deal rather than a traditional merger. This allowed NVIDIA to gain almost immediate access to Groq's entire IP portfolio and core team, leading to the launch of the LP30 chip and LPX rack system based on Groq's third-generation LPU architecture in under four months.

The LP30 chip is manufactured using Samsung's SF4 process, features 500MB of on-chip SRAM, and delivers 1.2 PFLOPS at FP8 precision. This represents a significant leap from Groq's first-generation LPU (230MB SRAM, 750 TFLOPS INT8), with performance gains primarily driven by the process node migration from GF16 to SF4.

The LP30 exists as a single monolithic die, eliminating the need for advanced packaging. Notably, using the SF4 process does not consume NVIDIA's scarce allocation of TSMC's N3 capacity nor utilize the equally constrained HBM resources. Therefore, the LPX system represents a genuine source of incremental production capacity and revenue, an advantage that competitors cannot easily replicate.

**The Core Value and Inherent Limitations of the LPU Architecture** The competitive strength of the LPU architecture lies in its high-bandwidth SRAM and deterministic pipeline execution mechanism. This gives it a first-token generation speed in single-user, low-latency scenarios that is difficult for GPUs to match. However, the trade-off for high-density SRAM is limited capacity. After loading model weights, very little space remains; as batch sizes increase, the KV Cache saturates rapidly, resulting in significantly lower overall throughput compared to GPUs.

Analysis suggests that independently deployed LPU systems are not cost-effective for large-scale token services. However, they can command substantial premiums in scenarios extremely sensitive to latency, which forms the basis for the LPU's role in a disaggregated decoding system.

**AFD Technology: Defining Roles for GPUs and LPUs** AFD technology splits the attention computation and the feed-forward network (FFN) computation in large model inference onto different hardware types. Attention computation, which involves dynamic KV Cache loading, is naturally suited for GPUs. FFN computation, being stateless and statically schedulable, aligns well with the deterministic nature of the LPU architecture.

Within this framework, the GPU focuses on attention computation, freeing up its HBM capacity entirely for KV Cache, thereby increasing the total number of tokens the system can handle concurrently. The LPU handles the FFN computation, leveraging its low-latency advantage. Communication of tokens between the GPU and LPU is managed via All-to-All collective operations, with ping-pong pipelining used to hide the communication latency.

Furthermore, LPUs can also function within a speculative decoding framework. Deploying the draft model or multi-token prediction (MTP) layers onto LPUs can further reduce the latency cost per decoding step, typically increasing the number of output tokens per step by a factor of 1.5 to 2.

**LPX Rack Architecture** The LPX rack consists of 32 x 1U LPU compute trays and 2 Spectrum-X switches. Each compute tray houses 16 LP30 chips, 2 Altera FPGAs (referred to by NVIDIA as "Fabric Expansion Logic"), 1 Intel Granite Rapids host CPU, and 1 BlueField-4 front-end module.

The FPGAs perform several critical functions: converting the LPU's C2C protocol to Ethernet for connection to the Spectrum-X scale-out network, providing a PCIe bridge between the LPUs and the host CPU, and supplying up to 256GB of DDR5 expansion memory per chip for KV Cache storage. The total scale-out bandwidth for the entire rack is approximately 640 TB/s.

The LPU modules are mounted "belly-to-belly" on both sides of the PCB, 8 on top and 8 on bottom, a design aimed at shortening the X and Y direction trace lengths required for the full interconnect mesh. Within a node, the 16 LPUs are connected in a full-mesh topology. Inter-node connections are made via a copper cable backplane, while inter-rack connections are achieved through front-panel OSFP interfaces.

**Vera ETL256: The Density Limit of 256 CPUs** As AI workloads place increasing demands on data preprocessing, scheduling orchestration, and reinforcement learning validation, CPUs are becoming a new bottleneck limiting GPU utilization. This is particularly pronounced in reinforcement learning scenarios, where CPUs need to run simulation environments in parallel, execute code, and validate outputs. The rate of GPU scale-out is far outpacing that of CPUs, necessitating ever-larger CPU clusters to keep GPUs fully utilized.

NVIDIA's solution is the Vera ETL256, which integrates 256 Vera CPUs into a single rack, relying on liquid cooling to achieve this density target.

The design logic follows that of the NVL compute rack: push computational density to the point where all connections within the rack can be covered by copper cabling, thereby completely eliminating the need for optical transceivers at the spine network level. The cost savings from using copper cables are sufficient to offset the additional expense of liquid cooling.

Specifically, the Vera ETL rack comprises 32 compute trays, arranged symmetrically with 16 trays above and 16 below, centered around 4 x 1U MGX ETL switch trays (based on Spectrum-6). This symmetrical layout intentionally minimizes the variation in cable length between each compute tray and the central switch trays, ensuring all connections remain within the feasible range for copper cables.

The rear ports of each switch tray handle intra-rack copper spine communication, while the 32 front-facing OSFP interfaces provide fiber optic connectivity to the rest of the POD. The intra-rack network utilizes a Spectrum-X multi-plane topology, distributing 200 Gb/s channels across the four switches to achieve a full-mesh Ethernet connection among all 256 CPUs within a single network layer. Each compute tray carries 8 Vera CPUs.

**STX: NVIDIA's Systematic Expansion into the Storage Layer** STX is a storage reference rack architecture launched by NVIDIA at GTC 2026. It complements the previously introduced CMX context storage platform, forming a complete strategy for NVIDIA's penetration into the storage infrastructure layer.

Building upon CMX, STX establishes a precise reference architecture that specifies the exact quantities of disk drives, Vera CPUs, BlueField-4 DPUs, CX-9 NICs, and Spectrum-X switches required in a cluster.

Each STX chassis contains 2 BlueField-4 units, totaling 2 Vera CPUs, 4 CX-9 NICs, and 4 SOCAMM modules. A full STX rack consists of 16 such chassis, corresponding to 32 Vera CPUs, 64 CX-9 NICs, and 64 SOCAMM modules.

Concurrently with the STX announcement, NVIDIA explicitly named several major storage vendors—including DDN, Dell Technologies, HPE, IBM, NetApp, Supermicro, and VAST Data—stating that these vendors will support the STX standard. This continues NVIDIA's established practice of using industry endorsements to strengthen the influence of its reference architectures.

Analysis suggests that the combination of BlueField-4, CMX, and STX represents NVIDIA's systematic advancement into the storage, software, and infrastructure operational layers, following its established dominance in the computing layer (GPUs) and network layer (Spectrum-X and NVLink).

Together, these three new systems significantly widen NVIDIA's product moat, indicating that a larger share of the AI infrastructure supply chain market will continue to consolidate around NVIDIA.

免責聲明:投資有風險,本文並非投資建議,以上內容不應被視為任何金融產品的購買或出售要約、建議或邀請,作者或其他用戶的任何相關討論、評論或帖子也不應被視為此類內容。本文僅供一般參考,不考慮您的個人投資目標、財務狀況或需求。TTM對信息的準確性和完整性不承擔任何責任或保證,投資者應自行研究並在投資前尋求專業建議。

熱議股票

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10