
浏览全部资源
扫码关注微信
1.中国联合网络通信有限公司研究院,北京 100048
2.北京邮电大学,北京 100876
3.北京基流科技有限公司,北京 100083
[ "张冬月(1995- ),女,中国联合网络通信有限公司研究院工程师,主要研究方向为智算中心网络。" ]
[ "辛奇(2000- ),男,北京邮电大学博士生,主要研究方向为高性能计算网络、算力网络等。" ]
[ "韩博文(1993- ),男,中国联合网络通信有限公司研究院工程师,主要研究方向为IP承载网络、新型数据中心网络、管控运维技术。" ]
[ "徐博华((1989- ),男,中国联合网络通信有限公司研究院高级工程师,主要研究方向为IP新技术和新型网络设备研发。" ]
[ "曹畅(1984- ),男,博士,中国联合网络通信有限公司研究院正高级工程师、部门总监,主要研究方向为算力网络、下一代互联网等" ]
[ "顾茹雅 (1990- ),女,现就职于北京基流科技有限公司,主要研究方向为AI软件栈和系统架构。" ]
收稿日期:2025-03-25,
修回日期:2025-04-12,
录用日期:2025-05-14,
纸质出版日期:2025-08-20
移动端阅览
张冬月,辛奇,韩博文等.智算中心网络的亚毫秒级网络监控系统[J].电信科学,2025,41(08):65-75.
ZHANG Dongyue,XIN Qi,HAN Bowen,et al.Sub-millisecond-level network monitoring system for intelligent computing centers network[J].Telecommunications Science,2025,41(08):65-75.
张冬月,辛奇,韩博文等.智算中心网络的亚毫秒级网络监控系统[J].电信科学,2025,41(08):65-75. DOI: 10.11959/j.issn.1000-0801.2025172.
ZHANG Dongyue,XIN Qi,HAN Bowen,et al.Sub-millisecond-level network monitoring system for intelligent computing centers network[J].Telecommunications Science,2025,41(08):65-75. DOI: 10.11959/j.issn.1000-0801.2025172.
随着智算中心成为支撑数字经济高质量发展的核心基础设施,千亿参数大模型训练对网络性能提出严苛要求,传统监控手段因采样精度不足、缺乏细粒度观测难以应对万卡集群的通信瓶颈。提出了亚毫秒级网络监控系统(sub-millisecond-level network monitoring system,sMon),通过在工作请求处理流水线中集成智能计数器与动态带宽分析模块,实时追踪网卡端口队列深度及流量波动。基于滑动窗口算法实现带宽吞吐的动态计算,在保持亚毫秒级时序精度的同时,采用异步日志采集机制将系统开销中央处理器(central processing unit,CPU)占用率降低至0.8%。在128节点A100集群的测试中,系统成功捕获了网络端口的亚毫秒级流量细节。实验证明该系统较传统监控方案在数据粒度上提升2个数量级。该方案通过高精度网络状态感知能力,为构建“超大规模、超高带宽、超强可靠”的智算中心网络提供了实时监控与性能保障,有效支撑大规模人工智能(artificial intelligence,AI)训练任务的需求。
As intelligent computing centers become core infrastructure supporting the high-quality development of the digital economy
the training of hundred-billion-parameter large models imposes stringent requirements on network performance. Traditional monitoring methods struggle to address communication bottlenecks in ten-thousand-card clusters due to insufficient sampling precision and lack of fine-grained observation. A sub-millisecond-level network monitoring system (sMon) was proposed
which integrated intelligent counters and dynamic bandwidth analysis modules into the workflow processing pipeline to enable real-time tracking of NIC port queue depth and traffic fluctuations. By implementing dynamic bandwidth calculation through sliding window algorithms
sub-millisecond temporal accuracy was maintained while reducing system overhead to 0.8% CPU utilization via asynchronous log collection mechanisms. Testing on a 128-node A100 cluster
the system’s capability was demonstrated to capture sub-millisecond traffic details at network ports. Experimental results show a two-order-of-magnitude improvement in data granularity compared with conventional monitoring solutions. The proposed system provides real-time monitoring and performance assurance for constructing “ultra-large-scale
ultra-high-bandwidth
ultra-reliable” intelligent computing center networks through high-precision network state perception
effectively supporting the requirements of large-scale AI training tasks.
中国信息通信研究院 . 中国综合算力指数报告(2024) [R ] . 2024 .
China Academy of Information and Communications Technology . China's comprehensive computing power index report (2024) [R ] . 2024 .
中国电子技术标准化研究院 . 新一代智算中心网络管控运维技术白皮书 [R ] . 2025 .
China Electronics Standardization Institute . White paper on network control and maintenance technology for a new generation of smart computing center [R ] . 2025 .
中国联通研究院 . 中国联通智算中心网络技术白皮书 [R ] . 2024 .
China Unicom Research Institute . China Unicom network technology white paper for smart computing center [R ] . 2024 .
华为技术有限公司 . 华为星河AI网络解决方案白皮书 [R ] . 2024 .
Huawei . Huawei Celestial River AI network solution white paper [R ] . 2024 .
中国通信标准化协会 . 超高性能数据中心网络技术要求 [S ] . 北京 : 中国标准出版社 , 2024 .
China Communication Standards Association . ltra-high performance data center network technology requirements [S ] . Beijing : Standards Press of China , 2024 .
ZHAO W X , ZHOU K , LI J Y , et al . A survey of large language models [EB ] . 2023 .
QIAN K , XI Y Q , CAO J M , et al . Alibaba HPN: a data center network for large language model training [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM Press , 2024 : 691 - 706 .
MEZA J , XU T Y , VEERARAGHAVAN K , et al . A large scale study of data center network reliability [C ] // Proceedings of the Internet Measurement Conference 2018 . New York : ACM Press , 2018 : 393 - 407 .
NARAYANAN D , SHOEYBI M , CASPER J , et al . Efficient large-scale language model training on GPU clusters using megatron-LM [C ] // Proceedings of the SC21: International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE Press , 2021 : 1 - 14 .
LI S G , LIU H X , BIAN Z D , et al . Colossal-AI: a unified deep learning system for large-scale parallel training [C ] // Proceedings of the 52nd International Conference on Parallel Processing . New York : ACM Press , 2023 : 766 - 775 .
DONG J , et al . Boosting large-scale parallel training efficiency with C4: a communication-driven approach [EB ] . 2024 .
JIANG Z H , LIN H B , ZHONG Y M , et al . MegaScale: scaling large language model training to more than 10 000 GPUs [C ] // Proceedings of the 21st NSDI Conference . Berkeley : USENIX Association , 2024 : 745 - 760 .
GANGIDI A , MIAO R , ZHENG S B , et al . RDMA over Ethernet for distributed training at meta scale [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM Press , 2024 : 57 - 70 .
SHALLUE C J , LEE J , ANTOGNINI J , et al . Measuring the effects of data parallelism on neural network training [EB ] . 2018 .
SHOEYBI M , PATWARY M , PURI R , et al . Megatron-LM: training multi-billion parameter language models using model parallelism [EB ] . 2019 .
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621