NVIDIA's Jim Fan: Robotics Sector in Chaotic State, Even Development Direction May Be Wrong

Recently, Jim Fan, head of NVIDIA's robotics business and co-head of the GEAR lab, published a lengthy post on social media offering harsh criticism of the robotics industry's current state. He believes that despite significant progress in hardware technology, the entire industry remains in a state of chaos regarding software iteration, standard setting, and the selection of technical pathways.

Jim Fan pointed out that the current mainstream Vision-Language-Action (VLA) model technical pathway "feels wrong," as its pre-training approach based on Vision-Language Models (VLM) is fundamentally misaligned with the actual needs of robotics. He indicated he is betting on video world models as an alternative solution.

This statement has garnered attention within the industry. Against the backdrop of rapid development in other areas of artificial intelligence, these fundamental issues in robotics technology highlight that the sector is still far from commercial application, which could impact investor valuation expectations for related companies.

Jim Fan summarized three lessons learned in robotics for 2025, covering core issues such as hardware reliability, industry standards, and technical pathways, providing a frontline perspective for understanding the current bottlenecks in the robotics industry.

Hardware reliability has become the biggest obstacle to software iteration. Jim Fan noted that although robots like Optimus, e-Atlas, Figure, Neo, and G1 demonstrate sophisticated engineering, hardware reliability severely limits the pace of software development. He stated that the most advanced AI has not yet fully utilized the complete capabilities of these cutting-edge hardware systems, suggesting that "the body's capabilities exceed the brain's command capabilities."

Unlike humans, robots cannot self-repair from damage. Issues like overheating, motor failure, and firmware anomalies occur daily, and errors are irreversible and intolerable. Maintaining these robots requires the support of entire operational teams.

Jim Fan lamented, "The only thing that scales with size is my patience." This statement reveals the reality of high human costs and low iterative efficiency in robotics research and development.

The lack of industry standards leads to a chaotic evaluation system. Jim Fan described the state of benchmarking in robotics as an "epic disaster." He pointed out that unlike the large language model field, which has established consensus standards like MMLU and SWE-Bench, the robotics industry lacks unified standards for hardware platforms, task definitions, scoring criteria, and simulator or real-world setups.

A common phenomenon in the industry is that each company defines its own benchmark tests when issuing press releases, based on which they claim to achieve "state-of-the-art" (SOTA) levels. More seriously, demonstration videos are often the best results selected from hundreds of attempts.

Jim Fan urged, "In 2026 we must do better and stop treating reproducibility and scientific discipline as second-class citizens." This criticism directly targets the fundamental problem of the industry's lack of scientific rigor.

The mainstream technical pathway faces fundamental questioning. Jim Fan raised fundamental doubts about the currently dominant VLA model. A common practice for VLA models is to graft an action module onto a pre-trained vision-language model, but this approach suffers from two core problems.

First, most parameters in a VLM serve language and knowledge, not physics. Second, to achieve high-level understanding, the visual encoder actively discards low-level details, yet these minute details are crucial for a robot's dexterous manipulation.

Jim Fan argues that VLMs are highly optimized for benchmarks like visual question answering, and their pre-training objectives are misaligned with robotics needs. He stated there is "no reason to believe VLA performance will scale with VLM parameters." He indicated he is betting on video world models as a more suitable pre-training objective for robot policies.

Jim Fan's views have sparked discussion within the industry. A user named Stewart Alsop questioned why models like Helix, GR00T N1, and π0, which represent actual delivered results, are still built on VLMs if video world models are superior, noting that world models are currently mainly used for policy evaluation and synthetic data rather than direct motion control.

Jim Fan responded that these are 2025 models, and he is looking forward to the next generation of large models in 2026.

免責聲明：投資有風險，本文並非投資建議，以上內容不應被視為任何金融產品的購買或出售要約、建議或邀請，作者或其他用戶的任何相關討論、評論或帖子也不應被視為此類內容。本文僅供一般參考，不考慮您的個人投資目標、財務狀況或需求。TTM對信息的準確性和完整性不承擔任何責任或保證，投資者應自行研究並在投資前尋求專業建議。

老虎證券

NVIDIA's Jim Fan: Robotics Sector in Chaotic State, Even Development Direction May Be Wrong

熱議股票