Beyond Ten-Thousand GPU Clusters: What Defines the Next Milestone in AI Infrastructure?

In the intensely competitive global AI race, the contest over AI infrastructure has become a central focus alongside the development of large AI models. Gartner forecasts that global AI spending will surge to $2.52 trillion by 2026, with infrastructure investment growing at a rapid 49% rate. As AI models scale to trillions of parameters and computing clusters expand to ten-thousand or even hundred-thousand GPU levels, a critical imbalance has emerged between users' computing power demands and the low utilization rates (MFU) of AI clusters. Enhancing the collaborative efficiency of AI infrastructure has therefore become an urgent challenge for the industry. The fundamental solution lies in improving how data flows and is processed across computing, storage, and network systems—essentially pushing data to move faster and break through existing limitations. To address this, Sugon has launched scaleFabric, its first fully self-developed 400G lossless high-speed network. Combined with technologies like super tunneling, this forms a tightly-coupled architecture integrating storage, computing, and transmission, providing AI clusters with efficient, secure, and stable data supply.

AI infrastructure is entering a phase of strong collaboration. In recent years, the development trajectory of AI infrastructure has become increasingly clear. The explosion of large AI models quickly spurred the rise of computing power and related infrastructure. After several years of phased construction, AI infrastructure now faces a new critical stage: government work reports have for the first time proposed developing ultra-large-scale intelligent computing clusters as part of new infrastructure. The practical issue the industry now confronts is how to fully utilize the massive computing resources of AI clusters to meet urgent user needs and further advance AI development. According to Shi Jing, Assistant President of Sugon Information Industry (Beijing) Co., Ltd. and General Manager of the Distributed Storage Division, current AI infrastructure faces core challenges in three areas: computing, storage, and networking. First, as AI clusters continue to expand in scale, amassing vast computing power, computing efficiency has become a bottleneck restricting overall AI development. A report from the China Academy of Information and Communications Technology indicates that computing power demands for large model training roughly double every 3.5 months. This suggests that ten-thousand-GPU-level and even larger AI clusters will become increasingly common, urgently requiring full release of computing efficiency. Second, storage, closely tied to data, must better match computing demands and fully assist in realizing computing efficiency. Third, if computing power is the core of the AI era and data the warehouse, then the network is the circulatory system. As AI clusters grow, the "communication wall" at the network layer is becoming a prominent challenge restricting cluster performance, making network performance a key variable affecting AI cluster efficiency. "Network performance is increasingly critical for AI clusters. Most newly built clusters are now transitioning to 400G networks," said Zong Ruibo, Product Manager for scaleFabric at Sugon. Indeed, scaleFabric, China's first native lossless RDMA high-speed network launched by Sugon, directly addresses the growing network performance challenges of AI clusters. Designed for ultra-large-scale intelligent computing clusters, scaleFabric is entirely self-developed by Sugon, from core IP and chips to network cards, switches, drivers, and management software, establishing a complete technological system from hardware to software. Undoubtedly, as rapid AI development drives continuous performance demands, the importance of AI infrastructure as an integrated whole is becoming more critical beyond localized breakthroughs. Computing, storage, and networking must form a unified, highly synergistic system. "AI infrastructure is entering a new development stage of tight coupling and strong collaboration. Only this approach maximizes users' return on investment," Shi Jing stated.

Achieving integrated computing, storage, and transmission relies on key technological advancements. If scaleFabric upgrades the data center network from a national highway to a super freeway, then Sugon's distributed storage "super tunneling" technology adds intelligent scheduling capabilities to this freeway. It designs dedicated data pathways based on different I/O types in AI clusters, allowing data to flow rapidly along optimal routes, effectively reducing network congestion and resource contention. This interconnects computing, storage, and networking, enabling integrated and efficient collaboration within AI infrastructure. In practice, AI places significant demands on data storage for high performance, high bandwidth, and low latency. Moreover, I/O characteristics vary markedly across different AI workloads like pre-training and inference. For instance, data loading during pre-training requires rapid sequential reading of massive datasets; checkpointing during training involves heavy concurrent read-write operations; while increasing inference tasks demand higher random small I/O throughput. Hence, the design philosophy behind "super tunneling" is particularly relevant. How does Sugon storage implement this technology and deeply integrate it with the self-developed RDMA high-speed network scaleFabric? Shi Jing explained that at the hardware level, super tunneling allocates exclusive RDMA network connections and PCIe channels to each data domain, optimizing resource allocation through NUMA affinity. At the software level, it binds and schedules threads, memory, and storage resources. Through coordinated hardware-software optimization, it achieves optimal pathways for high-speed data flow, integrating computing, storage, and networking to consistently provide stable data load support for AI computing. Specifically, super tunneling leverages the high performance and low latency of RDMA networks, using unique virtual network card technology to create multiple virtual mini-cards. This ensures balanced link performance for different data streams while isolating resources to prevent interference and contention. Traditional approaches pre-allocate resources like memory for each connection between computing, storage, and networking—a rigid model that becomes problematic as AI clusters scale and applications proliferate, especially with the emergence of AI agents causing explosive growth in inference services. Vast data connections can quickly exhaust precious infrastructure resources and create performance bottlenecks. Super tunneling introduces dynamism and intelligence to data transmission. While ensuring basic service startup, it flexibly allocates resources per connection, enabling quick initiation and dynamic adjustment of resources like memory based on traffic changes, ensuring high efficiency in data flow and transmission across the data center. According to Shi Jing, the effectiveness of super tunneling in integrating storage, computing, and transmission stems from Sugon's long-term strategy of full-stack self-development. Hardware is built using domestic components, while software possesses complete source code, achieving full autonomy in infrastructure and software stacks. This provides a solid foundation for efficient synergy, perfectly supporting AI workload demands.

Practical validation demonstrates accelerated data performance. Theoretical elegance must withstand the test of real-world AI applications. In February, at a core node of the National Supercomputing Internet, three scaleX ten-thousand-GPU super-clusters commenced trial operation simultaneously, forming China's largest operational domestic AI computing power pool. This serves as strong validation for Sugon's tightly-coupled, strongly collaborative architecture. Leveraging the zero-threshold deployment advantage of native RDMA networks, these three scaleX clusters achieved application readiness within just 36 hours from switch power-on. Having undergone nearly a year of stable testing while servicing over 100,000 jobs, the super-clusters have proven their performance, scalability, and stability. In practical applications, utilizing RDMA high-speed networks combined with super tunneling, various applications have seen significant efficiency gains. Examples include doubling application performance for a meteorological simulation client and supporting a top domestic research team in improving protein research efficiency by three to six orders of magnitude. Furthermore, leading domestic large model vendors have verified the advantages of RDMA networks plus super tunneling on scaleX clusters, achieving high-performance support across the entire AI pipeline—from pre-training data preparation and training checkpoints to inference. "A single storage system supports the customer's full business flow across training and inference scenarios," Shi Jing added. Undoubtedly, the compatibility and support for diverse workloads provided by RDMA high-speed networks and super tunneling technology will bring broader application prospects to the integrated computing-storage-transmission architecture. Looking ahead, future data centers will evolve into data-centric organisms where computing, storage, and networking are deeply integrated. Only by eliminating all obstacles to data flow can the value of computing power be maximized. Prior to scaleFabric's release, China had minimal presence in the high-performance networking sector. Customers faced a difficult choice between high-performing but closed foreign solutions and compatible but higher-latency traditional Ethernet options. Now, with Sugon's distributed storage super tunneling technology tightly integrated with scaleFabric, the final piece has been added to the domestic AI infrastructure puzzle. This completes a full closed-loop of homegrown technology in the AI infrastructure field, propelling China's artificial intelligence industry toward deeper and broader horizons.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

Beyond Ten-Thousand GPU Clusters: What Defines the Next Milestone in AI Infrastructure?

Most Discussed