Yang Zhilin Responds: Kimi K2 Trained on H800 GPUs – But "Only Cost $4.6M"?

Deep News
2025/11/11

The claim that Kimi K2 Thinking was trained for just $4.6 million has sparked discussions. Yang Zhilin, co-founder of Moonshot AI, clarified that this figure is unofficial, as training costs are difficult to quantify due to significant research and experimentation expenses.

The team revealed they used NVIDIA H800 GPUs with Infiniband, operating with fewer GPUs than industry giants but maximizing each card’s efficiency. Despite this, Kimi K2’s performance and cost-effectiveness are driving a migration wave in Silicon Valley.

Investor Chamath Palihapitiya shared that his new company shifted AI workloads to Kimi K2 for its superior speed and affordability. Vercel’s CEO cited internal tests showing Kimi K2 is 5x faster and 50% more accurate than closed-source models. Users of Claude Code are also swapping to Kimi K2 configurations.

Comparisons to DeepSeek V3’s reported $5.6M training cost have fueled debates: Are closed-source giants’ valuations justified when open-source alternatives deliver equal or better performance at lower costs? Some argue Moonshot AI itself deserves reevaluation.

**How Kimi K2 Achieved Efficiency** Technical analyses highlight Kimi K2’s optimization of open-source foundations, particularly its architectural similarities to DeepSeek. Key tweaks include: - Expanding MoE experts from 256 to 384 for greater knowledge capacity. - Reducing activated parameters per inference from ~37B to 32B to cut costs. - Enlarging the vocabulary to 160k and trimming dense feedforward blocks for computational efficiency.

Engineering innovations played a pivotal role: - The proprietary *MuonClip* optimizer enabled stable gradients, achieving "zero training crashes" across 15.5T tokens without manual intervention. - Quantization-aware training (QAT) allowed native INT4 inference, doubling speed with minimal performance loss.

**Moonshot AI’s Reddit AMA Highlights** In a 3-hour Reddit AMA on r/LocalLLaMA, co-founders Yang Zhilin, Zhou Xinyu, and Wu Yuxin addressed ~200 questions: - **Next-gen architecture (K3):** Experimental KDA (Key-Dependent Attention) hybrid mechanism outperformed RoPE in speed and benchmarks, potentially debuting in K3. - **Roadmap:** A Claude Code-like *Kimi Code* is underway; vision-language models are in development but delayed by data challenges. Longer context windows may return after cost optimizations. - **K2’s quirks:** The team acknowledged its "overthinking" inefficiency, pledging to streamline reasoning in future versions.

When asked about Kimi’s refusal to over-praise users, they attributed it to deliberate dataset design. Its distinctive writing style stems from pretraining knowledge and post-training "taste" tuning via RL.

As for K3’s release? The team joked, "Stay tuned."

[1] Reddit AMA link omitted per guidelines.

免責聲明:投資有風險,本文並非投資建議,以上內容不應被視為任何金融產品的購買或出售要約、建議或邀請,作者或其他用戶的任何相關討論、評論或帖子也不應被視為此類內容。本文僅供一般參考,不考慮您的個人投資目標、財務狀況或需求。TTM對信息的準確性和完整性不承擔任何責任或保證,投資者應自行研究並在投資前尋求專業建議。

熱議股票

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10