
浏览全部资源
扫码关注微信
1.中国电信股份有限公司研究院,北京 102209
2.网络与交换技术全国重点实验室(北京邮电大学),北京 100876
[ "苏昱臻(1998- ),男,中国电信股份有限公司研究院工程师,主要研究方向为数据中心RDMA网络与异构计算。" ]
[ "王子潇(1995- ),男,中国电信股份有限公司研究院工程师,主要研究方向为高性能RDMA网络与异构计算。" ]
[ "钟驰量(2001- ),男,北京邮电大学硕士生,主要研究方向为可编程网络与RDMA。" ]
[ "寇晓淮(1989- ),男,中国电信股份有限公司研究院工程师,主要研究方向为数据中心网络。" ]
[ "刘圆(1979- ),男,博士,中国电信股份有限公司研究院工程师,主要研究方向为智能计算基础设施和大语言模型。" ]
[ "陈映(1993- ),男,中国电信股份有限公司研究院工程师,主要研究方向为智算基础设施。" ]
收稿日期:2025-06-09,
修回日期:2025-07-19,
录用日期:2025-07-15,
纸质出版日期:2025-08-20
移动端阅览
苏昱臻,王子潇,钟驰量等.面向异构算力互联的智算网络关键技术研究[J].电信科学,2025,41(08):51-64.
SU Yuzhen,WANG Zixiao,ZHONG Chiliang,et al.Research on key technologies of intelligent computing network for heterogeneous computing power interconnection[J].Telecommunications Science,2025,41(08):51-64.
苏昱臻,王子潇,钟驰量等.面向异构算力互联的智算网络关键技术研究[J].电信科学,2025,41(08):51-64. DOI: 10.11959/j.issn.1000-0801.2025183.
SU Yuzhen,WANG Zixiao,ZHONG Chiliang,et al.Research on key technologies of intelligent computing network for heterogeneous computing power interconnection[J].Telecommunications Science,2025,41(08):51-64. DOI: 10.11959/j.issn.1000-0801.2025183.
算力供给的代际异构性与供应链安全需求,促使异构算力成为AI基础设施的新趋势。然而,在异构混合训练场景中,基于融合以太网的RDMA版本2(RDMA over converged Ethernet version 2,RoCEv2)方案存在负载均衡与拥塞控制缺陷,在模型训练的并行通信中性能欠佳;而现有高性能同构智算网络方案因设备异构与集合通信库(collective communication library,CCL)闭源难以部署。为此,提出了面向异构算力场景的高性能智算网络解决方案——智能控制以太网(intelligent control Ethernet,ICE)。该方案基于RoCEv2协议体系,在避免对设备、CCL进行深度定制的前提下,将异构通信库信息采集、集中控制器与端侧自主控制相结合,实现全局最优路径规划及全局主动拥塞控制,显著提升异构并行通信性能。真实物理环境实验表明,ICE可提升集合通信性能最高达47%。ICE为异构智算网络建设提供了开创性、易部署的解决方案。
The intergenerational heterogeneity of computing supply and the demand for supply chain security have made the driving forces behind heterogeneous computing becoming an emerging trend in AI infrastructure. However
in heterogeneous hybrid training scenarios
the RoCEv2 (RDMA over converged Ethernet version 2) solution suffered from deficiencies in load balancing and congestion control
resulting in suboptimal parallel communication performance during model training. Meanwhile
existing high-performance homogeneous intelligent computing network solutions were faced with deployment barriers due to the heterogeneity of devices and closed-source CCL (collective communication library). To address these challenges
ICE (intelligent control Ethernet)
a high-performance intelligent computing network solution for heterogeneous computing scenarios
was proposed. Based on the RoCEv2 protocol framework
ICE was designed to avoid deep customization of devices and CCL. Through a combination of heterogeneous communication library information collection
a centralized controller
and autonomous control at the end side
global optimal path planning and global active congestion control were achieved
significantly enhancing heterogeneous parallel communication performance. Experiments conducted in real-world physical environments demonstrate that ICE improves performance by up to 47%. Thus
ICE presents as a pioneering and easily deployable solution for constructing heterogeneous intelligent computing networks.
HOWARTH J . Number of parameters in GPT-4 [EB ] . 2025 .
PATEL D , NISHBALL D , ONTIVEROS J E . Multi-datacenter training: OpenAI's ambitious plan to beat Google’s infrastructure [EB ] . 2024 .
NARAYANAN D , SHOEYBI M , CASPER J , et al . Efficient large-scale language model training on GPU clusters using megatron-LM [C ] // Proceedings of the SC21: International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE Press , 2021 : 1 - 14 .
SHANLEY T . InfiniBand network architecture [M ] . Boston : Addison Wesley Developer's Press , 2003 .
NVIDIA . RDMA over converged Ethernet (RoCE) [EB ] . 2025 .
王欣 . 英伟达加快AI芯片路线图:黄仁勋透露GPU将一年一更 [EB ] . 2025 .
WANG X . NVIDIA accelerates AI chip roadmap: Huang Renxun reveals GPU will be updated annually [EB ] . 2025 .
The White House . Fact sheet: ensuring U.S. security and economic strength in the age of artificial intelligence [EB ] . 2025 .
ZHU Y B , ERAN H , FIRESTONE D , et al . Congestion control for large-scale RDMA deployments [J ] . ACM SIGCOMM Computer Communication Review , 2015 , 45 ( 4 ): 523 - 536 .
GANGIDI A , MIAO R , ZHENG S B , et al . RDMA over Ethernet for distributed training at meta scale [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM , 2024 : 57 - 70 .
宋婧 . 腾讯发布星脉网络2.0, AI大模型训练效率提升20% [EB ] . 2024 .
SONG J . Tencent releases Star Pulse Network 2.0, increasing AI model training efficiency by 20% [EB ] . 2024 .
华为 , 中国信通院 . 星河AI网络白皮书 [R ] . 2025 .
HUAWEI , CAICT . Xinghe network white paper [R ] . 2025 .
JACOBS R A , JORDAN M I , NOWLAN S J , et al . Adaptive mixtures of local experts [J ] . Neural Computation , 1991 , 3 ( 1 ): 79 - 87 .
NVIDIA . NVIDIA Spectrum-X networking platform [EB ] . 2025 .
WU X G . Reducing job completion time in AI/ML clusters [EB ] . 2025 .
DONG J B , WANG S C , FENG F , et al . ACCL: architecting highly scalable distributed training systems with highly efficient collective communication library [J ] . IEEE Micro , 2021 , 41 ( 5 ): 85 - 92 .
MIAO R , ZHU L J , MA S , et al . From luna to solar: the evolutions of the compute-to-storage networks in Alibaba cloud [C ] // Proceedings of the ACM SIGCOMM 2022 Conference . New York : ACM Press , 2022 : 753 - 766 .
LI Y L , MIAO R , LIU H H , et al . HPCC: high precision congestion control [C ] // Proceedings of the ACM Special Interest Group on Data Communication . New York : ACM Press , 2019 : 44 - 58 .
JUNIPER NETWORK . Paragon insights data ingest guide [EB ] . 2022 .
LI B J , WANG X L , WANG J Z , et al . TCCL: co-optimizing collective communication and traffic routing for GPU-centric clusters [C ] // Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing . New York : ACM Press , 2024 : 48 - 53 .
杨维 . 多芯异构混池训练技术及实践 [EB ] . 2025 .
YANG W . Multi core heterogeneous mixed pool training technology and practice [EB ] . 2025 .
NVIDIA . Developing a Linux Kernel Module using GPUDirect RDMA [EB ] . 2025 .
LIU K F , JIANG Z , ZHANG J , et al . R-pingmesh: a service-aware RoCE network monitoring and diagnostic system [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM Press , 2024 : 554 - 567 .
RAMAKRISHNAN K , FLOYD S , BLACK D . The addition of explicit congestion notification (ECN) to IP:RFC3168-2011 [S ] . 2011 .
IEEE . Qbb–priority-based flow control:802.1-2011 [S ] . 2011 .
KUMAR G , DUKKIPATI N , JANG K , et al . Swift: delay is simple and effective for congestion control in the datacenter [C ] // Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication . New York : ACM Press , 2020 : 514 - 528 .
ZHOU Y , CHEN Z J , MAO Z M , et al . An extensible software transport layer for GPU networking [J ] . arXiv preprint , 2025 , arXiv: 2504.17307 .
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621