浏览全部资源
扫码关注微信
中国电信股份有限公司研究院,北京 102209
[ "王学聪(1989- ),男,中国电信股份有限公司研究院工程师,主要研究方向为人工智能、云计算、智算网络等。" ]
[ "冀思伟(1996- ),女,中国电信股份有限公司研究院工程师,主要研究方向为云计算、算力网络、智算网络等。" ]
[ "李聪(1993- ),女,中国电信股份有限公司研究院未来网络研究中心副总监,主要研究方向为未来网络技术、下一代互联网技术、数据中心网络等。" ]
收稿日期:2024-04-03,
修回日期:2024-06-14,
纸质出版日期:2024-06-20
移动端阅览
王学聪,冀思伟,李聪.面向大模型预训练的智算网络技术研究[J].电信科学,2024,40(06):160-172.
WANG Xuecong,JI Siwei,LI Cong.Research on intelligent computing network technology for large-scale pre-trained models[J].Telecommunications Science,2024,40(06):160-172.
王学聪,冀思伟,李聪.面向大模型预训练的智算网络技术研究[J].电信科学,2024,40(06):160-172. DOI: 10.11959/j.issn.1000-0801.2024167.
WANG Xuecong,JI Siwei,LI Cong.Research on intelligent computing network technology for large-scale pre-trained models[J].Telecommunications Science,2024,40(06):160-172. DOI: 10.11959/j.issn.1000-0801.2024167.
随着人工智能的发展,大规模预训练模型在自然语言处理和计算机视觉等领域都取得了显著成果,促进了智算中心的建设。针对面向大模型预训练的智算网络关键技术展开研究,系统梳理了智算网络国内外最新的标准化进展,提出了一种面向智算网络的目标架构,探讨了智算网络关键技术的原理,包括远程直接内存访问(RDMA)、IB(InfiniBand)、基于以太网的RDMA(RoCE)、集合通信等,同时也分析了智算网络目前存在的问题以及未来的发展趋势,在推动智算网络技术发展、指导智算中心建设等方面具有重要意义。
With the development of artificial intelligence
significant achievements are made in various fields such as natural language processing and computer vision through the utilization of large-scale pre-trained models
which promotes the construction of intelligent computing centers. Key technologies related to large-scale pre-trained models in intelligent computing networks were studied. The latest standardization progress of intelligent computing network at home and abroad was systematically reviewed. A target architecture for intelligent computing network was proposed
and the principles of key technologies
including remote direct memory access (RDMA)
IB
RoCE
and collective communication
were explored. Moreover
the current issues and future development trends of intelligent computing networks were analyzed. This research holds crucial importance in advancing the development of intelligent computing network technology and providing guidance for the establishment of intelligent computing centers.
中国信息通信研究院 . 中国算力发展指数白皮书 [R ] . 2023 .
CAICT . China computing power development index white paper [R ] . 2023 .
ITU . Network capability enhancement for distributed artificial intelligent computing centers in next generation network evolution : TD389 [S ] . 2023 .
ITU . Functional requirements for the controller of wide area lossless network in NGNe : TD294 [S ] . 2023 .
GUO C X , WU H T , DENG Z , et al . RDMA over commodity Ethernet at scale [C ] // Proceedings of the Proceedings of the 2016 ACM SIGCOMM Conference . New York : ACM Press , 2016 : 202 - 215 .
百度 . 智算中心网络架构白皮书 [R ] . 2023 .
Baidu . White paper on network architecture of intelligent computing center [R ] . 2023 .
熊先奎 , 袁进辉 , 宋庆春 . 面向分布式AI的智能网卡低延迟Fabric技术 [J ] . 中兴通讯技术 , 2020 , 26 ( 5 ): 23 - 28 .
XIONG X K , YUAN J H , SONG Q C . Low latency fabric technology of smart NIC for distributed AI [J ] . ZTE Technology Journal , 2020 , 26 ( 5 ): 23 - 28 .
中国信息通信研究院 . 超融合数据中心网络白皮书 [R ] . 2021 .
CAICT . Hyperconverged data center network white paper [R ] . 2021 .
中国电信集团有限公司 . 新一代智算数据中心(AIDC)基础设施技术方案白皮书 [R ] . 2023 .
China Telecom . White paper on the new generation of intelligent data center (AIDC) infrastructure technology [R ] . 2023 .
赵俊峰 , 李芳 , 叶晓峰 , 等 . 面向广域RDMA的确定性网络需求与技术 [J ] . 电信科学 , 2023 , 39 ( 11 ): 39 - 51 .
ZHAO J F , LI F , YE X F , et al . Research on deterministic networking requirements and technologies for RDMA-WAN [J ] . Telecommunications Science , 2023 , 39 ( 11 ): 39 - 51 .
GONG Y Z , ZHANG W , CHEN Y F , et al . How to adapt RDMA congestion control algorithm based on local conditions [C ] // Proceedings of the 2023 IEEE International Performance, Computing, and Communications Conference (IPCCC) . Piscataway : IEEE Press , 2023 : 40 - 45 .
HU Y R , SHI Z , NIE Y , et al . DCQCN advanced (DCQCN-a): combining ECN and RTT for RDMA congestion control [C ] // Proceedings of the 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) . Piscataway : IEEE Press , 2021 : 1192 - 1198 .
THAO NGUYEN T , WAHIB M , TAKANO R . Efficient MPI-AllReduce for large-scale deep learning on GPU-clusters [J ] . Concurrency and Computation: Practice and Experience , 2021 , 33 ( 12 ): 25 - 30 .
BAI W , ABDEEN S S , AGRAWAL A , et al . Empowering azure storage with RDMA [C ] // Proceedings of 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23 ), 2023 : 49 - 67 .
CHEN Y Q , TIAN C , DONG J Q , et al . Swing: providing long-range lossless RDMA via PFC-relay [J ] . IEEE Transactions on Parallel and Distributed Systems , 2023 , 34 ( 1 ): 63 - 75 .
LUO W F , LAI D H , REN B H , et al . Dynamic load balancing algorithm for distributed database based on PI feedback [C ] // Proceedings of the 2022 3rd International Conference on Intelligent Design (ICID) . Piscataway : IEEE Press , 2022 : 277 - 280 .
LAKHOTIA K , PETRINI F , KANNAN R , et al . Accelerating allreduce with In-network reduction on intel PIUMA [J ] . IEEE Micro , 2022 , 42 ( 2 ): 44 - 52 .
CHEN C C , KHORASSANI K S , ANTHONY Q G , et al . Highly efficient alltoall and alltoallv communication algorithms for GPU systems [C ] // Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . Piscataway : IEEE Press , 2022 : 24 - 33 .
0
浏览量
16
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构