面向AI算力场景的多元异构混合训练系统研究

李攀攀; 牛红韦华; 赵万龙; 马华伟; 王艳辉; 江伟; 张雯欣; 陆一鸣; 赵峰

doi:10.11959/j.issn.1000-0801.2025164

您当前的位置：

首页 >

文章列表页 >

面向AI算力场景的多元异构混合训练系统研究

专栏 | 更新时间：2025-08-07

- 面向AI算力场景的多元异构混合训练系统研究
- Research on multi-heterogeneous hybrid training system for AI computing power scenarios
- 电信科学 2025年41卷第7期页码：133-144
- 作者机构：
  
  1.中移（苏州）软件技术有限公司，江苏苏州 215123
  2.中国移动通信集团设计院有限公司，北京 100080
  3.中国移动通信集团设计院有限公司安徽分公司，安徽合肥 230041
- 作者简介：
  
  [ "李攀攀（1990- ），男，现就职于中移（苏州）软件技术有限公司，主要研究方向为云计算、智算中心、人工智能技术等。" ]
  [ "牛红韦华（1989- ），女，中移（苏州）软件技术有限公司计划建设部副总经理，主要研究方向为大规模智算中心、公有云资源池架构、云管理平台等。" ]
  [ "赵万龙（1994- ），男，现就职于中移（苏州）软件技术有限公司，主要研究方向为智算基础设施、人工智能技术等。" ]
  [ "马华伟（1976- ），男，中国移动通信集团设计院有限公司高级工程师，主要研究方向为云计算、人工智能和数据业务网等。" ]
  王艳辉（1989- ），男，现就职于中移（苏州）软件技术有限公司，主要研究方向为智算基础设施、人工智能技术等。
  江伟（1980- ），男，中国移动通信集团设计院有限公司高级工程师，主要研究方向为云计算、IT网络等。
  张雯欣（1998- ），女，现就职于中移（苏州）软件技术有限公司，主要研究方向为智算基础设施、人工智能技术等。
  陆一鸣（1998- ），男，现就职于中移（苏州）软件技术有限公司，主要研究方向为智算基础设施、人工智能技术等。
  赵峰（1991- ），男，现就职于中移（苏州）软件技术有限公司，主要研究方向为智算基础设施、人工智能技术。
- 基金信息：
- DOI：10.11959/j.issn.1000-0801.2025164
  中图分类号： TP393
- 收稿日期：2025-03-21，
  
  修回日期：2025-07-02，
  
  录用日期：2025-06-12，
  
  纸质出版日期：2025-07-20
- 稿件说明：
移动端阅览
李攀攀,牛红韦华,赵万龙等.面向AI算力场景的多元异构混合训练系统研究[J].电信科学,2025,41(07):133-144.

LI Panpan,NIU Hongweihua,ZHAO Wanlong,et al.Research on multi-heterogeneous hybrid training system for AI computing power scenarios[J].Telecommunications Science,2025,41(07):133-144.
李攀攀,牛红韦华,赵万龙等.面向AI算力场景的多元异构混合训练系统研究[J].电信科学,2025,41(07):133-144. DOI： 10.11959/j.issn.1000-0801.2025164.

LI Panpan,NIU Hongweihua,ZHAO Wanlong,et al.Research on multi-heterogeneous hybrid training system for AI computing power scenarios[J].Telecommunications Science,2025,41(07):133-144. DOI： 10.11959/j.issn.1000-0801.2025164.

摘要

大语言模型训练是人工智能（artificial intelligence，AI）发展的核心场景，在算力多元化和异构化趋势下，跨生态异构算力协同能力将成为十万卡级训练的关键支撑。基于此背景，设计了一套异构AI算力混合训练系统，该系统能够主动检测、适配异构AI芯片，实现异构算力间的集合通信。基于该原型系统，在一个由3种异构算力组成的RoCEv2网络互通集群实现了多种异构算力组合的混训。在异构流水线并行（pipeline parallelism，PP）混训场景下，英伟达与壁仞的最优混训效率达到99.77%，英伟达、天数智芯、壁仞的最优混训效率可达99.03%。在异构数据并行（data parallelism，DP）混训场景下，英伟达与壁仞的最优混训效率达到92.88%。

Abstract

Large language model training is a pivotal scenario in AI development. Under the trend of diversified and heterogeneous computing power

the cross-ecosystem heterogeneous computing power collaboration capability will become the key support for training at the hundred-thousand-card-scale. Based on this background

a heterogeneous AI computing power mixed training system was designed

which could automatically detect and adapt to heterogeneous AI chips

enabling collective communication among heterogeneous computing powers. Based on the prototype system

heterogeneous training was implemented using three types of AI chips in a RoCEv2-interoperable cluster. In the heterogeneous pipeline parallelism (PP) training scenario

peak training efficiency reached 99.77% using NVIDIA and Biren GPU

and 99.03% using NVIDIA

Iluvatar

and Biren GPU. For heterogeneous data parallelism (DP) training

the optimal mixed training efficiency between NVIDIA and Biren GPU reached 92.88%.

关键词

Keywords

references

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [J ] . Advances in Neural Information Processing Systems , 2017 : 5998 - 6008 .

KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [J ] . arXiv preprint , 2020 : 2001 .08361.

WEI J , TAY Y , BOMMASANI R , et al . Emergent abilities of large language models [J ] . arXiv preprint , 2022 : 2206 .07682.

GRATTAFIORI A , DUBEY A , JAUHRI A , et al . The Llama 3 herd of models [J ] . arXiv preprint , 2024 : 2407 .21783,.

BARKER B . Message passing interface (MPI) [C ] // Workshop: high performance computing on stampede . Houston : Cornell University Publisher , 2015 : 262 .

HARLAP A , NARAYANAN D , PHANISHAYEE A , et al . PipeDream: fast and efficient pipeline parallel DNN training [J ] . arXiv preprint , 2018 : 1806 .03377.

JIANG Y M , ZHU Y B , LAN C , et al . A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters [C ] // Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2020 : 471 - 487 .

PARK J H , YUN G , YI C M , et al . HetPipe: enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism [J ] . arXiv preprint , 2020 : 2005 .14038.

JIA X Y , JIANG L , WANG A , et al . Whale: efficient giant model training over heterogeneous GPUs [J ] . arXiv preprint , 2020 : 2011 .09208.

XU S , HUANG Z X , ZENG Y , et al . HETHUB: a distributed training system with heterogeneous cluster for large-scale models [J ] . arXiv preprint , 2024 : 2405 .16256.

WU Q , WANG W H , FAN P Y , et al . Cooperative edge caching based on elastic federated and multi-agent deep reinforcement learning in next-generation networks [J ] . IEEE Transactions on Network and Service Management , 2024 , 21 ( 4 ): 4179 - 4196 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

大语言模型对齐研究综述

基于双通道理论的通信认知增强技术研究