云化的智算中心万卡集群创新与实践

丁宏庆; 张鹏飞; 牛红韦华; 李志勇; 周丹媛; 丁国强; 李攀攀; 李道通; 张久仙

doi:10.11959/j.issn.1000-0801.2024262

您当前的位置：

首页 >

文章列表页 >

云化的智算中心万卡集群创新与实践

工程与应用 | 更新时间：2024-12-30

- 云化的智算中心万卡集群创新与实践
- Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
- 电信科学 2024年40卷第12期页码：125-135
- 作者机构：
  
  1.中国移动通信集团有限公司，北京 100032
  2.中移（苏州）软件技术有限公司，江苏苏州 215123
  3.中国移动通信集团浙江有限公司，浙江杭州 311103
  4.中国移动通信集团设计院有限公司，北京 100080
- 作者简介：
  
  [ "丁宏庆（1972- ），男，中国移动通信集团有限公司计划建设部副总经理，主要研究方向为算力网络、云计算及人工智能技术等。" ]
  [ "张鹏飞（1977- ），男，中国移动通信集团有限公司高级工程师，主要研究方向为IT信创技术、云计算及人工智能技术等。" ]
  [ "牛红韦华（1989- ），女，中移（苏州）软件技术有限公司计划建设部总经理助理，主要研究方向为大规模智算中心、公有云资源池架构、云管理平台等。" ]
  [ "李志勇（1986- ），男，中国移动通信集团浙江有限公司高级工程师，主要研究方向为云计算和人工智能技术等。" ]
  [ "周丹媛（1990- ），女，中国移动通信集团有限公司高级项目经理，主要研究方向为云计算和人工智能技术等。" ]
  [ "丁国强（1986- ），男，中国移动通信集团设计院有限公司高级研究专员，主要研究方向为云计算和人工智能技术等。" ]
  [ "李攀攀（1990- ），男，中移（苏州）软件技术有限公司软件架构师，主要研究方向云计算、人工智能技术等。" ]
  [ "李道通（1990- ），男，中移（苏州）软件技术有限公司解决方案经理，主要研究方向为公有云资源池、智算资源池方案等。" ]
  [ "张久仙（1988- ），男，中移（苏州）软件技术有限公司解决方案架构师，主要研究方向云计算、高性能网络、人工智能技术等。" ]
- 基金信息：
- DOI：10.11959/j.issn.1000-0801.2024262
  中图分类号： TP338
- 收稿日期：2024-11-07，
  
  修回日期：2024-12-02，
  
  纸质出版日期：2024-12-20
- 稿件说明：
移动端阅览
丁宏庆,张鹏飞,牛红韦华等.云化的智算中心万卡集群创新与实践[J].电信科学,2024,40(12):125-135.

DING Hongqing,ZHANG Pengfei,NIU Hongweihua,et al.Cloud-based intelligent computing center ten-thousand card cluster innovation and practice[J].Telecommunications Science,2024,40(12):125-135.
丁宏庆,张鹏飞,牛红韦华等.云化的智算中心万卡集群创新与实践[J].电信科学,2024,40(12):125-135. DOI： 10.11959/j.issn.1000-0801.2024262.

DING Hongqing,ZHANG Pengfei,NIU Hongweihua,et al.Cloud-based intelligent computing center ten-thousand card cluster innovation and practice[J].Telecommunications Science,2024,40(12):125-135. DOI： 10.11959/j.issn.1000-0801.2024262.

摘要

为解决智算中心超大规模算力集群算力可用率低、国产技术成熟度低、大规模组网效率存在瓶颈、运营运维复杂等问题，提出了一种基于云计算技术构建智算中心万卡集群的系统。采用18 432块神经网络处理单元（neural processing unit，NPU）卡和优化后的基于以太网的远程直接内存访问（remote direct memory access，RDMA）网络构建云化的智算中心万卡集群，结合软件定义网络（software defined network，SDN）技术实现RDMA网络租户隔离，实现了链路负载均衡误差小于10%，集群All-Reduce带宽达35 GB/s以上。采用优化后的分布式存储协议，实现模型断点恢复时长缩短为原来的1/2。验证结果表明，经过软硬件协同优化，国产化的NPU万卡集群不仅能够满足千亿参数大模型训练的需求，未来更可以支撑万亿参数大模型训练任务。

Abstract

To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers

low maturity of domestically produced technologies

bottlenecks in large-scale networking efficiency

and complex operations and maintenance

a system based on cloud computing technology for constructing a ten-thousand card cluster in an intelligent computing center was proposed. A ten-thousand card cluster was constructed using 18 432 NPU units and an optimized RDMA network. A multi-plane network architecture was adopted

in conjunction with SDN technology to achieve RDMA network tenant isolation. The network load balancing strategy was optimized

resulting in a link load balancing error of less than 10% and an All-Reduce bandwidth of over 35 GB/s. By employing the optimized distributed storage protocol

the model’s breakpoint recovery time was reduced to half of its original duration. The validation results demonstrate that the domestic NPU ten-thousand card cluster

with the collaborative optimization of software and hardware

can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters.

关键词

Keywords

references

IDC、浪潮信息、清华全球产业院 . 2022—2023全球计算力指数评估报告 [R ] . 2023 .

IDC, Inspur Information, Tsinghua Institute of Global Industry . 2022-2023 global computing index evaluation report [R ] . 2023 .

中国软件评测中心 . 人工智能大语言模型技术发展研究报告（2024年） [R ] . 2024 .

China Software Testing Center . Research report on the development of artificial intelligence large language model technology (2024) [R ] . 2024 .

中国信息通信研究院 . 中国算力发展指数白皮书（2022年） [R ] . 2022 .

China Academy of Information and Communications Technology . White paper on China’s computing power development index (2022) [R ] . 2022 .

AN W , BI X , CHEN G T , et al . Fire-Flyer AI-HPC: a cost-effective software-hardware co-design for deep learning [J ] . arXiv preprint 2024: 2408 . 14158 v 1 .

KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [J ] . arXiv preprint , 2020 : 2001 .08361.

腾讯云开发者社区 . 大模型在机器学习领域的运用及其演变: 从深度学习的崛起至生成式人工智能的飞跃 [EB ] . 2024 .

Tencent cloud developer community . The application and evolution of large models in machine learning: from the emergence of deep learning of artificial intelligence to generate the type leap [EB ] . 2024 .

阿里云开发者社区 . 10万亿！达摩院发布全球最大AI预训练模型M6 [EB ] . 2021 .

Alibaba Cloud Developer Community . 10 trillion! damo academy releases the world’s largest AI pre-trained model M6 [EB ] . 2021 .

通信产业网 . 万卡集群: 为什么？是什么？怎么建？ [EB ] . 2024 .

Communications Industry Network . Wan Ka Cluster: Why? What? How to Build? [EB] . 2024 .

鲍中帅 . 万卡级超大规模智算集群网络运维挑战及实战 [EB ] . 2024 .

BAO Z S . The operation and maintenance challenge and actual combat of super scale intelligent computing cluster network [EB ] . 2024 .

中国移动 . 面向超万卡集群的新型智算技术白皮书（2024年） [R ] . 2024 .

China Mobile . For a new type of wisdom is above all card cluster technology, the white paper (2024) [R ] . 2024 .

WEI Y Z , HU T S , LIANG C , et al . Communication optimization for distributed training: architecture, advances, and opportunities [J ] . arXiv preprint ， 2024 : 2403 .07585.

SHOEYBI M , PATWARY M , PURI R , et al . Megatron-LM: training multi-billion parameter language models using model parallelism [J ] . arXiv preprint ， 2019 : 1909 .08053.

NVIDIA Deep Learning NCCL Documentation . Collective operations [EB ] . 2024 .

DONG J B , WANG S C , FENG F , et al . ACCL: architecting highly scalable distributed training systems with highly efficient collective communication library [J ] . IEEE Micro , 2021 , 41 ( 5 ): 85 - 92 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

5G-Advanced核心网运行智能架构演进、关键技术及应用研究

Prophet模型在话单量智能预测中的应用研究

构建未来：通信大数据开放平台在智能社会治理中的创新应用与发展前景

业务驱动的空天地海一体化网络技术研究

面向大模型的智算网络发展研究