1.之江实验室,浙江 杭州 311121
2.国防科技大学计算机学院,湖南 长沙 410073
3.北京科技大学计算机与通信工程学院,北京 100083
[ "张慧峰(1992- ),女,博士,之江实验室高级工程师,主要研究方向为新型网络体系架构、网络拓扑结构、高性能网络相关技术等。" ]
[ "刘宁春(1994- ),男,博士,之江实验室助理研究员,主要研究方向为新型网络体系架构、故障诊断与健康管理等。" ]
[ "龙卫平(1988- ),男,之江实验室高级工程师,主要研究方向为高性能网络。" ]
[ "陆平静(1984- ),女,博士,国防科技大学计算机学院副研究员,主要研究方向为高性能计算、高性能互联网络、数据中心网络。" ]
[ "邹涛(1974- ),男,博士,之江实验室研究专家、研究员,主要研究方向为新型网络体系架构、网络协议设计优化。" ]
[ "隆克平(1968- ),男,博士,北京科技大学计算机与通信工程学院教授、博士生导师,主要研究方向为新一代网络技术、光互联网关键技术、无线通信技术、人工智能与大数据。" ]
[ "张汝云(1973- ),博士,之江实验室研究专家,主要研究方向为工业互联网和网络通信安全。" ]
[ "朱俊(1981- ),男,博士,之江实验室工程专家、高级工程师,主要研究方向为新型网络体系结构、软件定义网络、网络资源管理、网络协议设计优化等。" ]
收稿:2025-06-04,
修回:2025-07-10,
录用:2025-08-06,
纸质出版:2025-12-20
移动端阅览
张慧峰,刘宁春,龙卫平等.面向国产算力的超大规模智算集群网络:关键挑战、技术途径与发展趋势[J].电信科学,2025,41(12):1-26.
ZHANG Huifeng,LIU Ningchun,LONG Weiping,et al.Hyperscale intelligent computing cluster networks for domestic computing power: critical challenges, technical pathways, and future trends[J].Telecommunications Science,2025,41(12):1-26.
张慧峰,刘宁春,龙卫平等.面向国产算力的超大规模智算集群网络:关键挑战、技术途径与发展趋势[J].电信科学,2025,41(12):1-26. DOI: 10.11959/j.issn.1000-0801.2025229.
ZHANG Huifeng,LIU Ningchun,LONG Weiping,et al.Hyperscale intelligent computing cluster networks for domestic computing power: critical challenges, technical pathways, and future trends[J].Telecommunications Science,2025,41(12):1-26. DOI: 10.11959/j.issn.1000-0801.2025229.
随着大模型等人工智能技术的快速发展,构建超大规模智算集群网络成为必然需求。然而,我国在建设此类基础设施过程中,面临NVIDIA GPU短缺、算力资源成本高昂及利用率偏低三大核心困境。鉴于国产算力相较于NVIDIA的产品与技术生态尚未成熟,系统性分析了引入国产算力后智算集群网络将面临的三大关键技术挑战:国产智算集群网络互联能力的提升;智算集群网络传输效率的提高;智算集群网络可用性的增强。针对上述挑战,从网络架构、网络设备、通信协议到网络故障等方面,深入研究现有技术路径与解决方案,并结合实际集群建设经验,提出面向自主可控、高效可靠智算集群网络基础设施的未来发展趋势,为国产化大规模智算集群建设提供理论支撑与实践参考。
With the rapid advancement of artificial intelligence technologies such as large-scale models
constructing ultra-large-scale intelligent computing clusters has become an imperative. However
China faces three core challenges in building such infrastructure: shortages of NVIDIA GPUs
prohibitively high costs of computing resources
and their chronic underutilization. Given the immaturity of domestic computing solutions relative to NVIDIA's established product and technological ecosystem
three critical technical challenges for intelligent computing cluster networks upon adopting domestic alternatives were systematically analyzed: enhancing interconnectivity capabilities within domestic computing clusters; improving data transmission efficiency across intelligent computing networks; and strengthening network availability guarantees. To address these challenges
an in-depth examination of existing technical approaches and solutions was conducted
spanning from network architecture and network devices to communication protocols and network fault tolerance. Drawing on practical cluster deployment experience
the future development trajectories toward building an autonomous
controllable
efficient
and reliable intelligent computing network infrastructure were further outlined. The theoretical foundations and practical references for large-scale domestic computing cluster construction were provided.
NVIDIA . NVIDIA DGX. SuperPOD: next generation scalable infrastructure for AI leadership [R ] . 2023 .
徐明强 . 微软高性能计算服务器 [M ] . 北京 : 人民邮电出版社 , 2010 .
XU M Q . Windows HPC server:step by step [M ] . Beijing : Posts & Telecom Press , 2010 .
JAISWAL S , JAIN K , SIMMHAN Y , et al . SageServe: optimizing LLM serving on cloud data centerswith forecast aware auto-scaling [J ] . arXiv preprint , 2025 ,arXiv: 2502.14617 .
COHEN O , SCHAPIRA J Y M , BELKAR S , et al . Routing for large ML models [J ] . arXiv preprint , 2025 ,arXiv: 2503.05324 .
DUBEY A , JAUHRI A , PANDEY A , et al . The llama 3 herd of models [J ] . arXiv preprint , 2024 ,arXiv: 2407.21783 .
GANGIDI A , MIAO R , ZHENG S B , et al . RDMA over Ethernet for distributed training at meta scale [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM , 2024 : 57 - 70 .
SMITH M S . Challengers are coming for Nvidia's crown: in AI's game of thrones, don't count out the upstarts [J ] . IEEE Spectrum , 2024 , 61 ( 10 ): 40 - 44 .
SCHNEIDER I , XU H , BENECKE S , et al . Life-cycle emissions of AI hardware: a cradle-to-grave approach and generational trends [J ] . arXiv preprint arXiv: 2502.01671 , 2025 .
HU H , YANG S , ZENG L , et al . US-China trade conflicts and R&D investment: evidence from the BIS entity lists [J ] . Humanities and Social Sciences Communications , 2024 , 11 ( 1 ): 829 .
LIU A , FENG B , XUE B , et al . DeepSeek-v3 technical report [J ] . arXiv preprint , 2024 ,arXiv: 2412.19437 .
WANG W Y , GHOBADI M , SHAKERI K , et al . Rail-only: a low-cost high-performance network for training LLMs with trillion parameters [C ] // Proceedings of the 2024 IEEE Symposium on High-Performance Interconnects (HOTI) . Piscataway : IEEE Press , 2024 : 1 - 10 .
AN W , BI X , CHEN G T , et al . Fire-flyer AI-HPC: a cost-effective software-hardware co-design for deep learning [C ] // Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE Press , 2024 : 1 - 23 .
QIAN K , XI Y Q , CAO J M , et al . Alibaba HPN: a data center network for large language model training [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM Press , 2024 : 691 - 706 .
GUO D , YANG D , ZHANG H , et al . Deepseek-R1: incentivizing reasoning capability in LLMs via reinforcement learning [J ] . arXiv preprint , 2025 , arXiv: 2501. 12948 .
JIANG Z , LIN H , ZHONG Y , et al . MegaScale: scaling large language model training to more than 10 000 GPUs [C ] // Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation . Boston, MA, USA : USENIX Association , 2024 : 745 - 760 .
中国移动通信集团有限公司 . 2024年面向超万卡集群的新型智算技术白皮书 [R ] . 2024 .
China Mobile . New intelligent computing technology white paper for ultra-large-scale clusters (2024) [R ] . 2024 .
百度智能云 . 智算中心网络架构白皮书 [R ] . 2025 .
Baidu AI Cloud . White paper on intelligent computing center network architecture [R ] . 2025
CLOS C . A study of non-blocking switching networks [J ] . Bell System Technical Journal , 1953 , 32 ( 2 ): 406 - 424 .
LEISERSON C E . Fat-trees: universal networks for hardware-efficient supercomputing [J ] . IEEE Transactions on Computers , 1985 , C-34( 10 ): 892 - 901 .
AL-FARES M , LOUKISSAS A , VAHDAT A . A scalable, commodity data center network architecture [J ] . ACM SIGCOMM Computer Communication Review , 2008 , 38 ( 4 ): 63 - 74 .
Cisco Systems . Cisco data center spine-and-leaf architecture: design overview [R ] . 2020
KIM J , DALLY W J , SCOTT S , et al . Technology-driven, highly-scalable dragonfly topology [C ] // Proceedings of the 2008 International Symposium on Computer Architecture . Piscataway : IEEE Press , 2008 : 77 - 88 .
BESTA M , HOEFLER T . Slim fly: a cost effective low-diameter network topology [C ] // Proceedings of the SC '14: Proceedings of the International Conference for High Performance Computing , Networking, Storage and Analysis . Piscataway : IEEE Press , 2014 : 348 - 359 .
BLACH N , BESTA M , DE SENSI D , et al . A high-performance design, implementation, deployment, and evaluation of the slim fly network [C ] // Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation . Boston, MA, USA : USENIX Association , 2024 : 1025 - 1044 .
LAKSHMIVARAHAN S , DHALL S K . Ring, torus and hypercube architectures/algorithms for parallel computing [J ] . Parallel Computing , 1999 , 25 ( 13/14 ): 1877 - 1906 .
LI A , SONG S L , CHEN J Y , et al . Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect [J ] . IEEE Transactions on Parallel and Distributed Systems , 2020 , 31 ( 1 ): 94 - 110 .
ARSID R . Ultra Ethernet and UALink: next-generation interconnects for AI infrastructure [J ] . IJSAT-International Journal on Science and Technology , 2025 , 16 ( 2 ): 3103 .
LU Y F , GU H X . Flexible and scalable optical interconnects for data centers: trends and challenges [J ] . IEEE Communications Magazine , 2019 , 57 ( 10 ): 27 - 33 .
JIANG L , YAN L S , YI A L , et al . Integrated components and solutions for high-speed short-reach data transmission [J ] . Photonics , 2021 , 8 ( 3 ): 77 .
NIKDAST M . Silicon photonics for high-performance computing: opportunities and challenges! [C ] // Proceedings of the 2018 Ninth International Green and Sustainable Computing Conference (IGSC) . Piscataway : IEEE Press , 2018 : 1 .
SHAHRIARI N . 1.1 AI era innovation matrix [C ] // 2025 IEEE International Solid-State Circuits Conference (ISSCC) . San Francisco, CA, USA . Piscataway : IEEE Press , 2025 : 10 - 15 .
LAGAEV D A , SHELEPIN N A , KLYUCHNIKOV A S . FD-SOI technology: comparison with FinFET and TCAD simulation [C ] // Proceedings of the 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus) . Piscataway : IEEE Press , 2021 : 1996 - 2001 .
LI D L . Virtualization and energy management optimization of high speed computer network data centers based on optical switching and network technology [J ] . Thermal Science and Engineering Progress , 2024 , 55 : 102918 .
ISONO H . Latest standardization trend and future prospects for 800 G/1.6 T optical transceivers [C ] // Proceedings of the Next-Generation Optical Communication: Components, Sub-Systems, and Systems Ⅻ . SPIE , 2023 : 14 .
HE J , LU D L , XUE H Y , et al . Design of a PAM-4 VCSEL-based transceiver front-end for beyond-400 G short-reach optical interconnects [J ] . IEEE Transactions on Circuits and Systems I: Regular Papers , 2022 , 69 ( 11 ): 4345 - 4357 .
MOAZENI S . Next-generation co-packaged optics for future disaggregated AI systems [J ] . arXiv preprint , 2023 ,arXiv: 2303.01744 .
MANIOTIS P , TERZENIDIS N , SIOKIS A , et al . Application-oriented on-board optical technologies for HPCs [J ] . Journal of Lightwave Technology , 2017 , 35 ( 15 ): 3197 - 3213 .
WILLNER A E . Optical fiber telecommunications Ⅶ [M ] . London : Academic Press , 2020 .
PEI Z X , SONG T , WU C , et al . Cross-timestep fault prediction with imbalanced data for optical modules in Internet data centers [C ] // Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD) . Piscataway : IEEE Press , 2024 : 1789 - 1794 .
PRABU R T , PANDIAN M M , DHANDAPANI A , et al . High-speed integrated optical transceivers for ultra-high modulation data rates in different optical communication applications [J ] . Journal of Optical Communications , 2025 .
SAXENA N , ROY A , KIM H . Traffic-aware cloud RAN: a key for green 5G networks [J ] . IEEE Journal on Selected Areas in Communications , 2016 , 34 ( 4 ): 1010 - 1021 .
SHANLEY T . InfiniBand network architecture [M ] . Addison-Wesley Professional , 2003 .
HOUGHTON T , KATZ R . High performance mass storage and parallel I/O [M ] . New York : Wiley-IEEE Press , 2001 .
SUN Z Z , GUO Z C , MA J D , et al . A high-performance FPGA-based RoCE v2 RDMA packet parser and generator [J ] . Electronics , 2024 , 13 ( 20 ): 4107 .
FREY P W . Zero-copy network communication: an applicability study of iWARP beyond micro benchmarks [D ] . Zurich, Switzerland : ETH Zurich , 2010 .
ZHANG J , YU F R , WANG S , et al . Load balancing in data center networks: a survey [J ] . IEEE Communications Surveys & Tutorials , 2018 , 20 ( 3 ): 2324 - 2352 .
HUANG S , DONG D Z , BAI W . Congestion control in high-speed lossless data center networks: a survey [J ] . Future Generation Computer Systems , 2018 , 89 : 360 - 374 .
XIA W F , ZHAO P , WEN Y G , et al . A survey on data center networking (DCN): infrastructure and operations [J ] . IEEE Communications Surveys & Tutorials , 2017 , 19 ( 1 ): 640 - 656 .
CAI Y P , YAN Y , ZHANG Z H , et al . Survey on converged data center networks with DCB and FCoE: standards and protocols [J ] . IEEE Network , 2013 , 27 ( 4 ): 27 - 32 .
IEEE . IEEE 802.1Qbb—IEEE standard for local and metropolitan area networks—media access control (MAC) bridges and virtual bridged local area networks—amendment 17: priority-based flow control [S ] . 2011 .
Cisco Systems . Priority flow control: build reliable layer 2 infrastructure [R ] . 2010 .
RAMAKRISHNAN K , FLOYD S , BLACK D . The addition of explicit congestion notification (ECN) to IP: RFC 3168 [S ] . 2001 .
REINEMO S A , SKEIE T , SODRING T , et al . An overview of QoS capabilities in infiniband, advanced switching interconnect, and Ethernet [J ] . IEEE Communications Magazine , 2006 , 44 ( 7 ): 32 - 38 .
SCHARF M , KIESEL S . NXG03-5: head-of-line blocking in TCP and SCTP: analysis and measurements [C ] // Proceedings of the IEEE Globecom 2006 . Piscataway : IEEE Press , 2006 : 1 - 5 .
CRUPNICOFF D , DAS S , ZAHAVI E . Deploying quality of service and congestion control in infiniband-based data center networks: Rev 1.00 [R ] . USA : Mellanox Technologies Inc , 2005
GEBARA N , GHOBADI M , COSTA P . In-network aggregation for shared machine learning clusters [J ] . Proceedings of Machine Learning and Systems , 2021 , 3 : 829 - 844 .
LAO C L , LE Y , MAHAJAN K , et al . ATP: in-network aggregation for multi-tenant learning [C ] // 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21 ), 2021 : 741 - 761 .
SAPIO A , CANINI M , HO C Y , et al . Scaling distributed machine learning with in-network aggregation [C ] // 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21 ), 2021 : 785 - 808 .
WANG R Q , DONG D Z , LEI F , et al . Roar: a router microarchitecture for in-network allreduce [C ] // Proceedings of the 37th International Conference on Supercomputing . New York : ACM Press , 2023 : 423 - 436 .
FENG A X , DONG D Z , LEI F , et al . In-network aggregation for data center networks: a survey [J ] . Computer Communications , 2023 , 198 : 63 - 76 .
GRAHAM R L , BUREDDY D , LUI P , et al . Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction [C ] // Proceedings of the 2016 First International Workshop on Communication Optimizations in HPC (COMHPC) . Piscataway : IEEE Press , 2016 : 1 - 10 .
LIU S , WANG Q L , ZHANG J Y , et al . In-network aggregation with transport transparency for distributed training [C ] // Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , Volume 3 . New York : ACM Press , 2023: 376 - 391 .
ZHU Y B , ERAN H , FIRESTONE D , et al . Congestion control for large-scale RDMA deployments [J ] . ACM SIGCOMM Computer Communication Review , 2015 , 45 ( 4 ): 523 - 536 .
MENIKKUMBURA D , TAHERI P , VANINI E , et al . Congestion control for datacenter networks: a control-theoretic approach [J ] . IEEE Transactions on Parallel and Distributed Systems , 2023 , 34 ( 5 ): 1682 - 1696 .
KIM H , RYU J , LEE J . TCCL: discovering better communication paths for PCIe GPU clusters [C ] // Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , Volume 3 . New York : ACM Press , 2024: 999 - 1015 .
AHMAD AWAN A , MANIAN K V , CHU C H , et al . Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2? [J ] . Parallel Computing , 2019 , 85 : 141 - 152 .
AWAN A A , HAMIDOUCHE K , VENKATESH A , et al . Efficient large message broadcast using NCCL and CUDA-aware MPI for deep learning [C ] // Proceedings of the 23rd European MPI Users' Group Meeting . New York : ACM Press , 2016 : 15 - 22 .
AHMAD AWAN A , CHU C H , SUBRAMONI H , et al . Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? [C ] // Proceedings of the 25th European MPI Users' Group Meeting . New York : ACM Press , 2018 : 1 - 9 .
LEE S , LEE J . Collective communication performance evaluation for distributed deep learning training [J ] . Applied Sciences , 2024 , 14 ( 12 ): 5100 .
HIDAYETOGLU M , DE GONZALO S G , SLAUGHTER E , et al . Hiccl: a hierarchical collective communication library [J ] . arXiv preprint , 2024 , arXiv: 2408.05962 .
HIRSCH D P . Internal development and open source strategies as methods for driving technological disruption in the software industry [D ] . vienna : Technische Universität Wien , 2025 .
ÁVILA OKADA K F , SILVA DE MORAIS A , OLIVEIRA-LOPES L C , et al . A survey on fault detection and diagnosis methods [C ] // Proceedings of the 2021 14th IEEE International Conference on Industry Applications (INDUSCON) . Piscataway : IEEE Press , 2021 : 1422 - 1429 .
ABID A , KHAN M T , IQBAL J . A review on fault detection and diagnosis techniques: basics and beyond [J ] . Artificial Intelligence Review , 2021 , 54 ( 5 ): 3639 - 3664 .
CHRYSANTHOU K , ENGLEZAKIS P , PRODROMOU A , et al . An online and real-time fault detection and localization mechanism for network-on-chip architectures [J ] . ACM Transactions on Architecture and Code Optimization , 2016 , 13 ( 2 ): 1 - 26 .
KUMARI P , KAUR P . A survey of fault tolerance in cloud computing [J ] . Journal of King Saud University - Computer and Information Sciences , 2021 , 33 ( 10 ): 1159 - 1176 .
QIU K , ZHAO J , WANG X , et al . Efficient recovery path computation for fast reroute in large-scale software-defined networks [J ] . IEEE Journal on Selected Areas in Communications , 2019 , 37 ( 8 ): 1755 - 1768 .
CHERRARED S , IMADALI S , FABRE E , et al . A survey of fault management in network virtualization environments: challenges and solutions [J ] . IEEE Transactions on Network and Service Management , 2019 , 16 ( 4 ): 1537 - 1551 .
HOPS C . analysis of an equal-cost multi-path algorithm: RFC 2992 [R ] . 2000
LIAO H , LIU B Y , CHEN X P , et al . UB-Mesh: a hierarchically localized nD-FullMesh datacenter network architecture [J ] . arXiv preprint , 2025 , arXiv: 2503.20377 .
0
浏览量
1
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621