浏览全部资源
扫码关注微信
1.中国信息通信研究院云计算与大数据研究所,北京 100191
2.北京交通大学电子信息工程学院,北京 100044
[ "郭亮(1980- ),男,中国信息通信研究院云计算与大数据研究所总工程师、正高级工程师,北京交通大学电子信息工程学院博士生,主要从事与算力相关的政策支撑、技术研究和标准制定工作。" ]
[ "王少鹏(1992- ),男,中国信息通信研究院云计算与大数据研究所数据中心部副主任、工程师,主要从事与算网融合相关的产业咨询、技术研究和标准制定工作。" ]
[ "权伟(1987- ),男,博士,北京交通大学电子信息工程学院教授、博士生导师,主要从事新型网络体系架构、高可靠网络传输关键技术研究工作。" ]
[ "李洁(1979- ),女,博士,中国信息通信研究院云计算与大数据研究所副所长、正高级工程师,开放数据中心委员会副主席,主要从事与算力相关的产业、政策、技术、标准研究工作。" ]
收稿日期:2024-04-03,
修回日期:2024-05-14,
纸质出版日期:2024-06-20
移动端阅览
郭亮,王少鹏,权伟等.面向大模型的智算网络发展研究[J].电信科学,2024,40(06):137-145.
GUO Liang,WANG Shaopeng,QUAN Wei,et al.Research on the development of intelligent computing network for large models[J].Telecommunications Science,2024,40(06):137-145.
郭亮,王少鹏,权伟等.面向大模型的智算网络发展研究[J].电信科学,2024,40(06):137-145. DOI: 10.11959/j.issn.1000-0801.2024147.
GUO Liang,WANG Shaopeng,QUAN Wei,et al.Research on the development of intelligent computing network for large models[J].Telecommunications Science,2024,40(06):137-145. DOI: 10.11959/j.issn.1000-0801.2024147.
近年来,全球进入智能计算的蓬勃发展期,作为具有巨量参数和复杂结构的深度学习模型,大模型训练需要在多卡、多服务器间实现训练参数的快速同步,所以对算力中心网络的带宽、时延、可靠性、可扩展性和安全性等提出更高要求。研究了面向大模型训练的智算网络的需求和相关关键技术,对智算网络的研究成果、标准规范和案例实践进行了分析,以期进一步促进智算网络的发展。
In recent years
the world has entered a period of vigorous development in intelligent computing. As deep learning models with huge parameters and complex structures
large model training requires fast synchronization of training parameters between multiple cards and servers
which imposes higher requirements on the bandwidth
latency
reliability
scalability and security of datacenter networks. The requirements and related key technologies of intelligent computing networks for large model training were studied
and the standard specifications
academic research
and case practices of intelligent computing networks were analyzed
in order to promote the development of intelligent computing networks.
CONGDON P , MARKS R . IEEE 802 Nendica report: the lossless network for data centers [R ] . 2018 .
GUO L , CONGDON P . IEEE 802 Nendica report: intelligent lossless data center networks [R ] . 2021 .
孙黎阳 , 温小振 , 郭亮 . 总线级数据中心网络关键技术研究 [J ] . 中国电信业 , 2022 ( 10 ): 75 - 77 .
SUN L Y , WEN X Z , GUO L . Research on key technologies of bus-level data center network [J ] . China Telecommunications Trade , 2022 ( 10 ): 75 - 77 .
JIANG Z H , LIN H B , ZHONG Y M , et al . MegaScale: scaling large language model training to more than 10 000 GPUs [J ] . arXiv preprint , 2024 : arXiv: 2402.15627 .
ZHANG Y W , KUMAR G , DUKKIPATI N , et al . Aequitas: admission control for performance-critical RPCs in datacenters [C ] // Proceeding of the 2022 ACM SIGCOMM'22 Conference , New York : ACM , 2022 .
NAMYAR P , SUPITTAYAPORNPONG S , ZHANG M Y , et al . A throughput-centric view of the performance of datacenter topologies [C ] // Proceedings of the 2021 ACM SIGCOMM 2021 Conference . New York : ACM , 2021 .
AL-FARES M , LOUKISSAS A , VAHDAT A . A scalable, commodity data center network architecture [C ] // Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication . New York : ACM , 2008 : 63 - 74 .
KIM J , DALLY W J , SCOTT S , et al . Technology-driven, highly-scalable dragonfly topology [C ] // Proceedings of the 2008 International Symposium on Computer Architecture . Piscataway : IEEE Press , 2008 : 77 - 88 .
唐雄燕 , 魏步征 , 沈世奎 , 等 . 智算数据中心光电交换技术综述 [J ] . 光通信研究 , 2024 ( 4 ): 1 - 17 .
TANG X Y , WEI B Z , SHEN S K , et al . Overview of optoelectronic switching technology in intelligent computing data centers [J ] . Study on Optical Communications , 2024 ( 4 ): 1 - 17 .
何宝宏 , 郭亮 , 王少鹏 . 星河AI网络白皮书 [R ] . 2023 .
HE B H , GUO L , WANG S P . Star river AI network white paper [R ] . 2023 .
王少鹏 , 郑常奎 , 芦帅 , 等 . 数据中心无损网络关键技术研究 [J ] . 信息通信技术与政策 , 2021 ( 10 ): 68 - 74 .
WANG S P , ZHENG C K , LU S , et al . Research on key technologies of data center lossless network [J ] . Information and Communications Technology and Policy , 2021 ( 10 ): 68 - 74 .
黄云皓 . 大模型时代的“非大模型”机会: 智算中心以太网 [R ] . 2023 .
HUANG Y H . Opportunities for “non large models” in the era of large models: Ethernet in intelligent computing centers [R ] . 2023
郭亮 , 李洁 , 高峰 . 数据中心智能无损网络白皮书 [R ] . 2021 .
GUO L , LI J , GAO F . White paper on intelligent lossless networking in data centers [R ] . 2021 .
MIAO Y L R , LIU H H . HPCC: high precision congestion control [C ] // Proceeding of the 2019 ACM SIGCOMM'19 Conference . New York : ACM , 2019 .
ZHANG Z , ZHENG S , WANG Y D , et al . MiCS: near-linear scaling for training gigantic model on public cloud [J ] . ArXiv e-Prints , 2022 : arXiv: 2205.00119 .
SHUKLA D , SIVATHANU M , VISWANATHA S , et al . Singularity: planet-scale, preemptive and elastic scheduling of AI workloads [J ] . arXiv preprint , 2022 , arXiv: 2202.07848 .
张宏科 , 于成晓 , 权伟 , 等 . 融算网络体系基础研究 [J ] . 电子学报 , 2022 , 50 ( 12 ): 2928 - 2934 .
ZHANG H K , YU C X , QUAN W , et al . Fundamental research on computing integration networking [J ] . Acta Electronica Sinica , 2022 , 50 ( 12 ): 2928 - 2934 .
胡玉姣 , 贾庆民 , 孙庆爽 , 等 . 融智算力网络及其功能架构 [J ] . 计算机科学 , 2022 , 49 ( 9 ): 249 - 259 .
HU Y J , JIA Q M , SUN Q S , et al . Functional architecture to intelligent computing power network [J ] . Computer Science , 2022 , 49 ( 9 ): 249 - 259 .
腾讯云计算(北京)有限责任公司 , 中国信息通信研究院云计算与大数据研究所 . 智算赋能算网新应用白皮书 [R ] . 2023 .
Tencent Cloud Computing (Beijing) Co. , Ltd. , Cloud Computing and Big Data Research Institute of the China Academy of Information and Communication Technology . White paper on new applications of intelligent computing empowerment network [R ] . 2023 .
李兆彤 , 史磊 , 周磊 . 智算中心网络架构白皮书 [R ] . 2023 .
LI Z T , SHI L , ZHOU L . White paper on network architecture of intelligent computing center [R ] . 2023 .
0
浏览量
18
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构