浏览全部资源
扫码关注微信
中国移动通信有限公司研究院,北京 100053
[ "段晓东(1977- ),男,中国移动通信有限公司研究院副院长、“新世纪百千万人才工程”国家级人选、正高级工程师,主要研究方向为下一代互联网、算力网络、5G网络架构、6G网络架构、SDN/NFV等。" ]
[ "李婕妤(1994- ),女,博士,现就职于中国移动通信有限公司研究院,主要研究方向为数据中心网络。" ]
[ "程伟强(1980- ),男,中国移动通信有限公司研究院基础网络技术研究所副所长、正高级工程师,主要研究方向为下一代互联网、数据中心网络、传输网。" ]
[ "李晗(1975- ),男,博士,中国移动通信有限公司研究院基础网络技术研究所所长、正高级工程师,主要研究方向为光通信和承载。" ]
[ "王瑞雪(1990- ),女,现就职于中国移动通信有限公司研究院,主要研究方向为数据中心网络、SDN/NFV、算力网络等。" ]
[ "王豪杰(1992- ),男,博士,现就职于中国移动通信有限公司研究院,主要研究方向为高速以太网。" ]
收稿日期:2024-04-01,
修回日期:2024-06-13,
纸质出版日期:2024-06-20
移动端阅览
段晓东,李婕妤,程伟强等.面向智算中心的新型以太网需求与关键技术[J].电信科学,2024,40(06):146-159.
DUAN Xiaodong,LI Jieyu,CHENG Weiqiang,et al.Challenges and key technologies of new Ethernet for intelligent computing center[J].Telecommunications Science,2024,40(06):146-159.
段晓东,李婕妤,程伟强等.面向智算中心的新型以太网需求与关键技术[J].电信科学,2024,40(06):146-159. DOI: 10.11959/j.issn.1000-0801.2024171.
DUAN Xiaodong,LI Jieyu,CHENG Weiqiang,et al.Challenges and key technologies of new Ethernet for intelligent computing center[J].Telecommunications Science,2024,40(06):146-159. DOI: 10.11959/j.issn.1000-0801.2024171.
AI大模型正引领下一个十年的信息与通信技术(information and communications technology,ICT)产业发展热点。智算中心网络是支撑AI大模型分布式训练的通信底座,是决定AI集群效能的关键要素之一。AI大模型的数据量和参数量不断扩张,给智算中心网络带来了严峻的挑战,同时给关键网络技术进行代际性创新带来了机遇。在AI大模型训练和推理过程中,提供数据的高性能和高安全传输是AI业务对智算中心网络的两大核心需求。高效的负载均衡、拥塞控制技术和网络安全协议是其中的关键网络技术。为应对大规模AI业务带来的严峻挑战,提出全调度以太网(global scheduled Ethernet,GSE)作为对应的解决方案,并搭建真实的测试环境对GSE和RoCE(remote direct memory access over converged Ethernet)网络进行性能对比测试。测试结果证明,GSE相较RoCE网络显著改善了任务完成时间(job completion time,JCT)。
AI large model is leading the hot ICT(information and communications technology) industry in the next decade. Intelligent computing center network is a communication base to support the distributed training of AI large model
and it is one of the key factors to determine the efficiency of AI clusters. The data volume and the number of parameters of AI large model are expanding continuously
which brings the network of intelligent computing centers serious challenges
and also brings an opportunity for intergenerational innovation of key network technologies. In the process of AI large model training and inferencing
providing high performance and high security transmission of data are the two core requirements of AI business for intelligent computing network. Efficient load balancing
congestion control technologies and network security protocols are the key network technologies. To address the challenge brought by large-scale AI business
global scheduling ethernet (GSE) was proposed as a corresponding solution
and realistic test environment was built to compare the performance of GSE and RoCE. The test results show that GSE significantly improves JCT compared with RoCE network.
ZHANG Z , CHANG C K , LIN H B , et al . " Is network the bottleneck of distributed training? "[EB ] . arXiv preprint arXiv , 2020 : 2006 .10103.
LUO L , WEST P , KRISHNAMURTHY A , CEZE L , et al . PLink: discovering and exploiting datacenter network locality for efficient cloud-based distributed training [C ] // Proceedings of 2020 MLSys . 2020 .
ROTHENBERGER B , TARANOV K , PERRIG A , et al . {ReDMArk}: Bypassing {RDMA} security mechanisms [C ] // 30th USENIX Security Symposium (USENIX Security 21). 2021 : 4277 - 4292 .
HOEFLER T , ROWETH D , UNDERWOOD K , et al . Datacenter Ethernet and RDMA: issues at hyperscale [EB ] . arXiv preprint arXiv , 2023 : 2302 .03337.
HOPPS C . Analysis of an equal-cost multi-path algorithm [R ] . Technical Report , 2000 .
QURESHI M A , CHENG Y C , YIN Q W , et al . PLB: congestion signals are simple and effective for network load balancing [C ] // Proceedings of the ACM SIGCOMM 2022 Conference . New York : ACM Press , 2022 : 207 - 218 .
SONG C H , KHOOI X Z , JOSHI R , et al . Network load balancing with in-network reordering support for RDMA [C ] // Proceedings of the ACM SIGCOMM 2023 Conference . 2023 : 816 - 831 .
DIXIT A , PRAKASH A , HU Y C , et al . On the impact of packet spraying in data center networks [C ] // Proceedings of IEEE INFOCOM 2013 . Piscataway : IEEE Press , 2013 : 2130 - 2138 .
SCHARF M , KIESEL S . NXG03-5: Head-of-line blocking in TCP and SCTP: analysis and measurements [C ] // Proceedings of the 49th IEEE Global Telecommunications Conference(GLOBECOM 2006) . Piscataway : IEEE Press , 2006 : 1 - 5 .
XUE J C , CHAUDHRY M U , VAMANAN B , et al . Dart: divide and specialize for fast response to congestion in RDMA-based datacenter networks [J ] . IEEE/ACM Transactions on Networking , 2020 , 28 ( 1 ): 322 - 335 .
ZHU Y B , ERAN H , FIRESTONE D , et al . Congestion control for large-scale RDMA deployments [J ] . ACM SIGCOMM Computer Communication Review , 2015 , 45 ( 4 ): 523 - 536 ..
HU S H , ZHU Y B , CHENG P , et al . Deadlocks in datacenter networks: why do they form, and how to avoid them [C ] // Proceedings of the 15th ACM Workshop on Hot Topics in Networks . New York : ACM Press , 2016 : 92 - 98 .
MITTAL R , THE L V , DUKKIPATI N , et al . TIMELY: RTT-based congestion control for the datacenter [C ] // Proceedings of SIGCOMM 2015 . 2015 : 537 - 550 .
LI Y , MIAO R , LIU H H , et al . HPCC: high precision congestion control [C ] // Proceedings of the ACM Special Interest Group on Data Communication . NewYork : ACM Press , 2019 : 44 - 58 .
PINKERTON J , DELEGANES E . RFC 5042:direct data placement protocol (DDP)/remote direct memory access protocol (RDMAP) security [R ] . IEFT , 2007 .
Google . Google white paper: PSP architecture specification [R ] . 2022 .
IEEE 802.1AE-2018: media access control (MAC) security [S ] . 2018 .
HOPPS C . RFC 9347:aggregation and fragmentation mode for encapsulating security payload (ESP) and its use for IP traffic flow security (IP-TFS) [R ] . IEFT , 2023 .
0
浏览量
15
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构