浏览全部资源
扫码关注微信
1.中移(苏州)软件技术有限公司,江苏 苏州 215123
2.中国移动通信集团设计院有限公司,北京 100080
3.中国移动通信集团设计院有限公司安徽分公司,安徽 合肥 230041
[ "李攀攀(1990- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为云计算、智算中心、人工智能技术等。" ]
[ "牛红韦华(1989- ),女,中移(苏州)软件技术有限公司计划建设部副总经理,主要研究方向为大规模智算中心、公有云资源池架构、云管理平台等。" ]
[ "赵万龙(1994- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为智算基础设施、人工智能技术等。" ]
[ "马华伟(1976- ),男,中国移动通信集团设计院有限公司高级工程师,主要研究方向为云计算、人工智能和数据业务网等。" ]
王艳辉(1989- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为智算基础设施、人工智能技术等。
江伟(1980- ),男,中国移动通信集团设计院有限公司高级工程师,主要研究方向为云计算、IT网络等。
张雯欣(1998- ),女,现就职于中移(苏州)软件技术有限公司,主要研究方向为智算基础设施、人工智能技术等。
陆一鸣(1998- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为智算基础设施、人工智能技术等。
赵峰(1991- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为智算基础设施、人工智能技术。
收稿日期:2025-03-21,
修回日期:2025-07-02,
录用日期:2025-06-12,
纸质出版日期:2025-07-20
移动端阅览
李攀攀,牛红韦华,赵万龙等.面向AI算力场景的多元异构混合训练系统研究[J].电信科学,2025,41(07):133-144.
LI Panpan,NIU Hongweihua,ZHAO Wanlong,et al.Research on multi-heterogeneous hybrid training system for AI computing power scenarios[J].Telecommunications Science,2025,41(07):133-144.
李攀攀,牛红韦华,赵万龙等.面向AI算力场景的多元异构混合训练系统研究[J].电信科学,2025,41(07):133-144. DOI: 10.11959/j.issn.1000-0801.2025164.
LI Panpan,NIU Hongweihua,ZHAO Wanlong,et al.Research on multi-heterogeneous hybrid training system for AI computing power scenarios[J].Telecommunications Science,2025,41(07):133-144. DOI: 10.11959/j.issn.1000-0801.2025164.
大语言模型训练是人工智能(artificial intelligence,AI)发展的核心场景,在算力多元化和异构化趋势下,跨生态异构算力协同能力将成为十万卡级训练的关键支撑。基于此背景,设计了一套异构AI算力混合训练系统,该系统能够主动检测、适配异构AI芯片,实现异构算力间的集合通信。基于该原型系统,在一个由3种异构算力组成的RoCEv2网络互通集群实现了多种异构算力组合的混训。在异构流水线并行(pipeline parallelism,PP)混训场景下,英伟达与壁仞的最优混训效率达到99.77%,英伟达、天数智芯、壁仞的最优混训效率可达99.03%。在异构数据并行(data parallelism,DP)混训场景下,英伟达与壁仞的最优混训效率达到92.88%。
Large language model training is a pivotal scenario in AI development. Under the trend of diversified and heterogeneous computing power
the cross-ecosystem heterogeneous computing power collaboration capability will become the key support for training at the hundred-thousand-card-scale. Based on this background
a heterogeneous AI computing power mixed training system was designed
which could automatically detect and adapt to heterogeneous AI chips
enabling collective communication among heterogeneous computing powers. Based on the prototype system
heterogeneous training was implemented using three types of AI chips in a RoCEv2-interoperable cluster. In the heterogeneous pipeline parallelism (PP) training scenario
peak training efficiency reached 99.77% using NVIDIA and Biren GPU
and 99.03% using NVIDIA
Iluvatar
and Biren GPU. For heterogeneous data parallelism (DP) training
the optimal mixed training efficiency between NVIDIA and Biren GPU reached 92.88%.
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [J ] . Advances in Neural Information Processing Systems , 2017 : 5998 - 6008 .
KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [J ] . arXiv preprint , 2020 : 2001 .08361.
WEI J , TAY Y , BOMMASANI R , et al . Emergent abilities of large language models [J ] . arXiv preprint , 2022 : 2206 .07682.
GRATTAFIORI A , DUBEY A , JAUHRI A , et al . The Llama 3 herd of models [J ] . arXiv preprint , 2024 : 2407 .21783,.
BARKER B . Message passing interface (MPI) [C ] // Workshop: high performance computing on stampede . Houston : Cornell University Publisher , 2015 : 262 .
HARLAP A , NARAYANAN D , PHANISHAYEE A , et al . PipeDream: fast and efficient pipeline parallel DNN training [J ] . arXiv preprint , 2018 : 1806 .03377.
JIANG Y M , ZHU Y B , LAN C , et al . A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters [C ] // Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2020 : 471 - 487 .
PARK J H , YUN G , YI C M , et al . HetPipe: enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism [J ] . arXiv preprint , 2020 : 2005 .14038.
JIA X Y , JIANG L , WANG A , et al . Whale: efficient giant model training over heterogeneous GPUs [J ] . arXiv preprint , 2020 : 2011 .09208.
XU S , HUANG Z X , ZENG Y , et al . HETHUB: a distributed training system with heterogeneous cluster for large-scale models [J ] . arXiv preprint , 2024 : 2405 .16256.
WU Q , WANG W H , FAN P Y , et al . Cooperative edge caching based on elastic federated and multi-agent deep reinforcement learning in next-generation networks [J ] . IEEE Transactions on Network and Service Management , 2024 , 21 ( 4 ): 4179 - 4196 .
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构