Cerebras' IPO: Deeply Tied to OpenAI, Reshaping AI Chip Market Expectations with "Fast Tokens"

Deep News
May 14

Cerebras' narrative has suddenly become compelling. A few years ago, it was a radical AI hardware company using "an entire wafer as a chip"—a bold technological approach with uncertain commercial prospects. Now, with fast inference becoming a priority for which large model developers are willing to pay a premium, and OpenAI signing a 750MW inference compute partnership, Cerebras finds itself at the cusp of an IPO window.

SemiAnalysis analyst Myron Xie succinctly captured the core shift in a research report on the 14th: "Beyond a certain intelligence threshold, developers prefer faster tokens over smarter tokens." This statement explains the pivot in Cerebras' valuation logic: it doesn't necessarily need to beat GPUs in all AI compute scenarios. As long as "high interaction speed" becomes a monetizable product, its wafer-scale architecture finds its purpose.

This is also Cerebras' most intriguing aspect. The WSE-3 packs 44GB of SRAM, compute cores, and on-wafer interconnect onto a single wafer, delivering memory bandwidth on the order of 21 PB/s, enabling inference speeds that reach ranges difficult for traditional HBM-based accelerators. However, the same architecture imposes limitations: SRAM capacity is not vast, off-wafer I/O is limited to 150 GB/s, and cooling, power delivery, and packaging are highly customized, making it increasingly challenging for very large models and long contexts.

OpenAI represents Cerebras' biggest opportunity, but also concentrates risk onto a single customer. The agreement corresponds to 750MW of inference compute, with OpenAI holding an option for an additional 1.25GW. Cerebras' disclosed remaining performance obligation stands at $24.6 billion. However, this deal is coupled with a $1 billion working capital loan, nearly free-warrant options, and intense data center delivery pressure. The key question for IPO investors is not "Is wafer-scale chip cool?" but rather: "Can the premium for fast tokens cover Cerebras' structural costs and single-customer risk?"

Cerebras is betting on "interaction speed," not "total throughput" Historically, the focus for AI inference hardware has been on how many tokens each GPU or rack can produce. For cloud providers and model developers, total throughput translates to unit cost and the ability to serve more users.

However, user behavior is pushing another metric to the forefront: tokens/sec/user—the speed at which a single user receives output.

OpenAI, Anthropic, and others are segmenting the same model into different service tiers: fast, priority, standard, batch. Whether users are willing to pay for faster responses is no longer just product manager speculation. Opus 4.6 fast was once priced at about 6x for a 2.5x interaction speed advantage; later, the speed advantage was reduced to about 1.75x. Even so, the high-speed mode remains a SKU developers are willing to pay for. SemiAnalysis itself saw its AI spending annualize to $10 million in April, with 80% spent on Opus 4.6 fast.

This indicates a market shift: when model capability is sufficiently usable, wait time becomes a productivity bottleneck. For coding, tool calling, and iterative agentic workflows, a few seconds' delay isn't just a user experience issue; it disrupts the workflow.

This is precisely where Cerebras' advantage lies. Instead of stacking more HBM for capacity, it leverages the extremely high bandwidth of on-wafer SRAM to excel in decode scenarios characterized by low batch size, low concurrency, and high interaction speed. In other words, a GPU is like a bus that can carry many people, while Cerebras is more like a sports car designed for high-speed, direct transport for a few passengers.

WSE-3 is not a "big GPU"; it's an entire wafer Cerebras' core product, the Wafer Scale Engine (WSE), treats an entire wafer as a single chip, rather than dicing it into dozens or hundreds of individual dies.

The WSE-3 is fabricated using TSMC's N5 process and consists of 12x7, or 84, identical stepping regions. Each wafer contains approximately 970,000 cores, with 900,000 enabled. Half the wafer area is dedicated to SRAM, the other half to compute cores. The key to this design is keeping both computation and memory on the same piece of silicon, minimizing the need for data to leave the chip or its package.

The specifications are striking: - SRAM capacity: 44GB - SRAM bandwidth: 21 PB/s - Off-wafer I/O: 150 GB/s - Marketing FP16 compute: 125 PFLOPs - Dense FP16 compute (adjusted for 8:1 unstructured sparsity): ~15.6 PFLOPs

These numbers require separate interpretation. The 21 PB/s memory bandwidth is Cerebras' strongest point. The 15.6 PFLOPs of dense FP16 compute is also significant, but when measured per unit silicon area, it's not as astonishing as the marketing figure suggests. The 125 PFLOPs figure stems from a sparsity assumption; materials refer to this algorithm jokingly as "Feldman's Formula," which essentially multiplies dense compute by 8.

The real dividing line lies in memory type. Mainstream AI accelerators like GPUs, TPUs, and Trainium place model weights and KV Cache in HBM. Cerebras aims to keep them in SRAM as much as possible. SRAM is fast with low latency but has a high cost per bit and low capacity density.

44GB of SRAM is substantial in the single-chip world. Compared to HBM, however, it's modest. A single HBM3E 12-Hi stack offers 36GB; a current high-end GPU or TPU package typically incorporates 8 stacks, totaling 288GB—6.5 times the SRAM capacity of WSE-3.

This is Cerebras' fundamental trade-off: trading capacity for speed.

The wafer wins in low arithmetic intensity decode, loses in large models and long context Cerebras is best suited for the decode phase, which is characterized by low arithmetic intensity and is memory bandwidth-bound.

In large model inference, many kernels are not compute-limited but memory bandwidth-limited. A GPU's Tensor Cores might be powerful, but if weights and KV Cache cannot be fed fast enough, the compute sits idle. By spreading massive SRAM across the wafer, keeping data close to compute units with ample bandwidth, Cerebras can achieve interaction speeds in low-concurrency decode scenarios (like batch=1) that are hard for traditional HBM systems to reach.

Theoretical comparisons in the materials are clear: for a decode kernel with batch=1 and arithmetic intensity around 2, NVIDIA GPUs and Groq LPUs might theoretically achieve tens to hundreds of TFLOPs; the Cerebras WSE-3, under ideal conditions, can approach its full 15.625 PFLOPs dense FP16 compute.

This is the hardware foundation for "fast tokens."

However, as models grow larger and contexts longer, the 44GB SRAM becomes a constraint. An inference system's memory must hold three things: 1. Model weights. 2. KV Cache required for concurrent requests. 3. Larger KV Cache due to long context.

Workloads like agentic coding are particularly challenging. Sample analysis of ~432,000 requests (~80 billion tokens) shows a typical P50 input sequence length of ~96.3k tokens, not the 64k assumed in Cerebras' product hypothesis; nearly 50% of requests exceed 128k, which matches the maximum context window currently supported by Cerebras' public endpoints.

This implies that as model services move towards 256k or 1M contexts, Cerebras must either compress KV Cache, use more wafers, or sacrifice interaction speed and economics.

Cooling and BOM illustrate: This is not cheap compute The CS-3 system is not as simple as plugging a chip into a server.

Each CS-3 includes a WSE-3 engine block, peripheral compute and I/O modules, two mechanical pumps, twelve 3.3kW power modules, and a liquid cooling system. The WSE-3 wafer itself consumes about 25kW across its 46,225 square millimeter area, resulting in an average heat flux density of ~50W/cm², not accounting for hotspots.

Air cooling is impractical. A standard 3D vapor chamber scaled to 21.5cm square would hit capillary limits, where working fluid return cannot keep up. Cerebras had to develop a custom liquid cooling structure: a four-layer "sandwich" of cold plate, wafer, flexible connector, and PCB, with a cooling manifold attached behind the cold plate. The differing thermal expansion coefficients of silicon and PCB would cause traditional packaging to crack, necessitating custom connection, pre-load, and assembly tools.

Data center infrastructure is also impacted. The GB200 NVL72 reference design targets a facility-side flow rate of ~1.5 LPM/kW. The WSE-3 at 25kW requires ~100 LPM, equating to 4 LPM/kW—nearly three times higher. This demands larger pumps, wider pipes, bigger CDUs, and higher-flow quick disconnects. The CS-4 would need to bring rack-level flow back to 1.5–1.7 LPM/kW to better align with standardized infrastructure.

Costs are also substantial. BOM estimates for a CS-3 rack with KVSS CPU nodes were around $350,000 per rack before Q4 memory price hikes; incorporating recent memory prices pushes this to ~$450,000 per rack. The KVSS is a dual-socket AMD CPU node with 6TB of DDR5 RDIMM, used for KV Cache offload.

Interestingly, the most expensive component isn't just the TSMC N5 wafer. A nominal N5 wafer costs about $20,000, but Cerebras incurs additional costs for extra upper metal masks per wafer batch to bypass defective tiles. Custom Vicor power modules are also expensive, with their value estimated in materials as comparable to the TSMC content. Cooling, packaging, and assembly involve significant in-house development. Peripherally, twelve 100GbE Xilinx FPGAs act like NICs, converting Cerebras' proprietary I/O to Ethernet.

Thus, Cerebras is not a "cheap chip replacing GPUs." It is a complex system trading for extreme interaction speed within a specific inference speed band.

SRAM scaling stagnation is an unavoidable node problem for Cerebras Cerebras relies heavily on SRAM, but SRAM scaling is slowing down.

The SRAM capacity progression across three WSE generations is telling: - WSE-1 (TSMC 16nm): 18GB SRAM. - WSE-2 (7nm): 40GB SRAM, a 2.2x generational increase. - WSE-3 (5nm): 44GB SRAM, only a ~10% increase.

While logic transistor count increased ~50% from 7nm to 5nm, SRAM capacity barely moved. The outlook is tougher. N3E offers minimal SRAM scaling over N5, and N2 and beyond continue to face limitations.

This is more critical for Cerebras than for GPU vendors. GPUs can continue stacking HBM, expanding packages, and pooling memory via interconnects. SRAM-based machines like Groq can use hybrid bonding to stack more SRAM tiles vertically. Cerebras uses an entire wafer; its planar area is already maximized. Increasing SRAM area would mean sacrificing compute area.

The CS-4 roadmap reveals this: it still uses an N5-based WSE-3 but increases power, clock speed, and compute sustainability, while SRAM capacity remains unchanged.

A potential direction is wafer-to-wafer hybrid bonding, stacking a DRAM wafer or more memory onto the WSE. Cerebras is indeed exploring this path. However, the thermomechanical challenges and bond wave issues for wafer-scale monolithic chips are more difficult than for conventional hybrid bonding. The company has solved many unconventional problems before, but the next step remains a formidable challenge.

The biggest weakness is I/O: A large wafer with a narrow exit The WSE-3's off-wafer bandwidth is only 150 GB/s, or 1.2 Tb/s. Relative to its compute scale and on-wafer bandwidth, this exit is too narrow.

This issue isn't due to engineers overlooking I/O importance; it's a geometric constraint inherent to the wafer-scale architecture.

The WSE consists of 84 identical stepping regions. Each reticle exposure pattern must be identical, with logic, SRAM, and routing in the same locations, to allow interconnects to extend continuously across scribe lines. This means you cannot place SerDes PHYs only on the edge reticles while filling the center with compute. Every reticle must be identical.

To increase edge I/O, PHYs would need to be placed in every reticle. The problem is that PHYs in the center have no way to connect to the external world, becoming wasted silicon area. Worse, high-speed SerDes PHYs are area-intensive, their analog circuits dislike proximity to digital logic, and they require guard regions. Placing them inside the wafer would create holes in the 2D mesh, increasing routing complexity and latency, undermining the very problem wafer-scale interconnect aims to solve.

The materials provide a telling figure: WSE's current off-wafer bandwidth density is ~0.17 GB/s per mm of edge. NVIDIA's off-package I/O density is approximately 130 times higher.

Cerebras' proposed solution is optical interconnect wafers: using hybrid bonding to stack a photonic interconnect wafer onto the WSE, allowing data to travel vertically (Z-axis) rather than squeezing out from the wafer edge. The partner for this is Ranovus.

This path is elegant but challenging. Optical components are temperature-sensitive, requiring a specific thermal environment, and they must be attached to a high-power wafer. Fiber coupling in standard CPO (Co-Packaged Optics) isn't yet fully engineered for easy mass production, let alone scaled to an entire wafer.

Large models will force Cerebras into pipelining, contradicting the "fast" premise If a model doesn't fit on a single WSE, it must be partitioned across multiple wafers.

However, the low I/O bandwidth rules out many common parallelization methods. High-bandwidth collective communication is impractical, as is frequently shuttling large tensors on and off the wafer. The most feasible remaining option is pipeline parallelism: splitting the model layer-wise across multiple WSEs, with each wafer holding the weights for its assigned layers, passing only activations between stages.

When serving Llama 3 70B, Cerebras splits the model across 4 WSE-3s, passing only activations between wafers, keeping communication within the 1.2 Tb/s I/O capability.

But pipelining introduces three problems: 1. Pipeline bubbles. With 4 stages, at least ~4 in-flight microbatches are needed to keep the pipeline busy; 16 stages require ~16. More stages increase scheduling difficulty. 2. Each in-flight microbatch has its own KV Cache, which must also fit within the 44GB SRAM alongside weights. Even with stronger KV compression in newer models, moving KV Cache on and off the wafer adds millisecond-level latency, increasing TTFT (Time To First Token) and TPOT (Time Per Output Token) pressure. 3. As the number of wafers increases, the fixed latency for activations traveling between wafers also increases linearly. Larger models deviate further from Cerebras' ideal scenario: low-batch, low-latency, high-speed decode on a single or few wafers.

The public product line also hints at boundaries. The Cerebras Inference Cloud's largest production model is GPT-OSS at 120B total parameters; the larger preview model GLM 4.7 goes up to 355B. Llama 70B and 405B were once available but were later taken down, possibly due to service economics. The hot open-source frontier models of 2025, DeepSeek V3 and Kimi K2, do not appear on Cerebras' public cloud.

This isn't an absolute dead end. Models like DeepSeek V4 Pro, with stronger KV Cache compression, could make 1T+ models serviceable again under sufficient concurrency. The question is whether this can be done while preserving Cerebras' most valuable asset: speed.

OpenAI brings Cerebras to the main table, but also concentrates risk OpenAI is not an ordinary customer in Cerebras' future.

In December 2025, the two parties signed a Master Relationship Agreement (MRA). OpenAI committed to purchasing 750MW of AI inference compute, to be deployed in batches from 2026 to 2028, with each batch having a 3–4 year term, extendable to 5 years. OpenAI also holds an option to purchase an additional 1.25GW, bringing the total potential to 2GW.

The S-1 filing discloses that as of December 31, 2025, Cerebras' remaining performance obligation is $24.6 billion. More importantly, pass-through costs like data center rent, power, leasehold improvements, and security are reimbursed by OpenAI and recognized as revenue.

OpenAI also provided a $1 billion working capital loan at a 6% annual interest rate. Interest can be waived if Cerebras repays through compute or hardware delivery. Repayment begins after the final delivery of the initial 250MW batch, amortized equally over three years. If the MRA terminates for reasons other than a material uncured breach by OpenAI, Cerebras may have to immediately repay all outstanding principal and accrued interest. OpenAI can also instruct the escrow bank to stop disbursing funds per Cerebras' instructions and take direct control of the funds.

The equity tie is also deep. Cerebras issued 33,445,026 shares of Class N non-voting common stock warrants to OpenAI with an exercise price of $0.00001, essentially free. A portion vested immediately due to the $1 billion loan; another portion is tied to a $40 billion valuation or payment milestone; the remainder is linked to compute delivery and the additional 2GW expansion option. On a fully diluted basis, OpenAI could hold up to ~12% of Cerebras, excluding subsequent issuances.

Under ASC 505-50, equity incentives granted to a customer are recognized as contra-revenue over the commercial agreement term. A rough calculation based on the S-1's $82.02 per share valuation suggests the warrants theoretically correspond to ~$2.74 billion in contra-revenue, roughly 10% of OpenAI's expected revenue.

This is an order that can change Cerebras' destiny, but it's also a structure that ties the company's fate to a single counterparty.

GPT-5.3-Codex-Spark proves speed's value but also exposes model size limitations With the release of GPT-5.3-Codex-Spark, Cerebras' narrative becomes more complete. This model uses the gpt-oss-120B architecture, distilled from the actual GPT-5.3-Codex, and can run on Cerebras at up to 2000 tok/sec/user.

The key is "120B." It is not the full GPT-5.3-Codex but a much smaller distilled version. The materials explicitly state it is over 10x smaller than the full model.

This is both good news and a limitation for Cerebras.

The good news is that a 120B-scale model, if sufficiently capable and paired with extremely fast output speed, could indeed become a high-value product. Developers have already shown willingness to trade some cutting-edge intelligence for faster tokens.

The limitation is that if OpenAI wants to run a 1T+ parameter model with a 1M context window, designed for real agentic workloads on Cerebras, it must accept significant cost trade-offs, and the actual interaction speed might fall below 1000 tok/sec. The ability to command a sufficiently high token premium is crucial for the business model's viability.

The path outlined in the materials is aggressive: smaller models continue to improve in capability, with the 120B form factor potentially approaching GPT-5.5-level intelligence within about a year. If this holds, Cerebras wouldn't need to host the most cutting-edge, largest-parameter models to sell expensive fast tokens. OpenAI's locked 750MW is just the first step; the real upside lies in whether it exercises the additional 1.25GW option or expands purchases further.

But this upside condition is narrow: Cerebras must prove it can consistently host models within its hardware's suitable size range that are smart enough and profitable enough.

The core IPO question: Can the fast token premium long-term cover the hardware trade-offs? Cerebras is not another GPU story. It's not aiming to replace NVIDIA comprehensively in training, large model general inference, or long-context throughput. Instead, it's placing a heavy bet on a narrower but potentially lucrative segment: high-interaction-speed, low-batch inference where users are willing to pay a premium.

The wafer-scale architecture gives it exceptional bandwidth and fast decode but saddles it with hard constraints: SRAM capacity, off-wafer I/O, cooling, BOM, and data center adaptation. The OpenAI order addresses the demand problem but does not eliminate delivery risk and customer concentration.

Therefore, Cerebras' IPO valuation should not be based solely on the $24.6 billion backlog or impressive speeds like 2000 tok/sec/user. More critical are three questions: 1. Will OpenAI's need for fast tokens long-term be satisfied by models in the 120B–355B range? 2. Can the premium users are willing to pay for speed cover Cerebras' more complex system costs? 3. Can the 750MW deployment proceed on schedule through 2028 without being hampered by cooling, power, supply chain, and data center capacity issues?

If the answers lean "yes," Cerebras could become one of the most distinctive AI hardware companies of the fast-inference era. If the answers lean "no," the speed advantage conferred by the entire wafer may be gradually eroded by the memory demands of ever-larger models and longer contexts.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Most Discussed

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10