
浏览全部资源
扫码关注微信
1.中移(苏州)软件技术有限公司,江苏 苏州 215123
2.中国移动通信集团有限公司,北京 100032
3.中国移动通信集团设计院有限公司,北京 100080
[ "娄涛(1978- ),男,中移(苏州)软件技术有限公司副总经理,主要研究方向为云计算、智算中心、网信安全等。" ]
[ "牛红韦华(1989- ),女,中移(苏州)软件技术有限公司计划建设部副总经理,主要研究方向为大规模智算中心、公有云资源池架构、云管理平台等。" ]
[ "张鹏飞(1977- ),男,中国移动通信集团有限公司高级工程师,主要研究方向为IT信创技术、云计算及人工智能技术等。" ]
董江帆(1987- ),男,中国移动通信集团设计院有限公司工程师,主要研究方向为云计算、智算、超算等。
李攀攀(1990- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为云计算、智算中心、人工智能技术等。lipanpan@cmss.chinamobile.com
李道通(1990- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为公有云资源池、智算资源池方案等。
许伟栋(1993- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为人工智能算法、大模型技术等。
姚成辉(1994- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为人工智能算法、大模型技术等。
薛连浩(1997- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为人工智能算法、大模型技术等。
唐婷(1997- ),女,现就职于中移(苏州)软件技术有限公司,主要研究方向为人工智能算法、大模型技术等。
向洁(1990- ),女,现就职于中移(苏州)软件技术有限公司,主要研究方向为大数据、智算中心等。
收稿日期:2025-03-21,
修回日期:2025-06-11,
纸质出版日期:2025-07-20
移动端阅览
娄涛,牛红韦华,张鹏飞等.基于国产NPU的超万卡智算集群大模型训练调优实践[J].电信科学,2025,41(07):122-132.
LOU Tao,NIU Hongweihua,ZHANG Pengfei,et al.Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU[J].Telecommunications Science,2025,41(07):122-132.
娄涛,牛红韦华,张鹏飞等.基于国产NPU的超万卡智算集群大模型训练调优实践[J].电信科学,2025,41(07):122-132. DOI: 10.11959/j.issn.1000-0801.2025166.
LOU Tao,NIU Hongweihua,ZHANG Pengfei,et al.Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU[J].Telecommunications Science,2025,41(07):122-132. DOI: 10.11959/j.issn.1000-0801.2025166.
为解决超万卡智算集群模型训练算效利用率低、稳定性差、调优难度高、国产技术生态差等问题,提出一种基于全国产化超万卡智算集群的大模型训练调优方案。通过自动分布式策略推荐、流水线并行优化、overlap优化和全链路profiling等技术,在16 384个国产NPU加速卡上实现了405B参数大模型的预训练,模型算力利用率(model FLOPS utilization,MFU)达到了45.13%,较基准性能提升了10%以上。同时,在模型训练全流程中构建稳定性保障机制,实现训前和训中关键指标的实时监控和训练任务秒级故障诊断。实验结果表明,提出的国产超万卡智算集群大模型训练方案能有效提升算力利用率,对未来国产智算集群建设与大模型训练有重要指导意义。
In order to solve the problems of low computing efficiency utilization
poor stability
high difficulty in training optimization
and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU
a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation
pipeline parallel optimization
overlap optimization and full-link profiling technology
the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU
which was more than 10% higher than the baseline performance. At the same time
a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training
as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency
and has important guiding significance for the future construction of domestic AI cluster and large language model training.
LI J Y , TANG T Y , ZHAO W X , et al . Pre-trained language models for text generation: a survey [J ] . ACM Computing Surveys , 2024 , 56 ( 9 ): 1 - 39 .
SINGHAL K , AZIZI S , TU T , et al . Large language models encode clinical knowledge [J ] . Nature , 2023 , 620 ( 7972 ): 172 - 180 .
ANNEPAKA Y , PAKRAY P . Large language models: a survey of their development, capabilities, and applications [J ] . Knowledge and Information Systems , 2025 , 67 ( 3 ): 2967 - 3022 .
KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [J ] . arXiv preprint , 2020 : 2001 .08361.
中国软件评测中心 . 人工智能大语言模型技术发展研究报告(2024年) [R ] . 2024 .
China Software Testing Center . Research report on the development of artificial intelligence large language model technology (2024) [R ] . 2024 .
DEVLIN J , CHANG M W , LEE K , et al . BERT: pre-training of deep bidirectional transformers for language understanding [J ] . arXiv preprint , 2019 : 1810 .04805.
BROWN T B , MANN B , RYDER N , et al . Language models are few-shot learners [J ] . arXiv preprint , 2020 : 2005 .14165.
GRATTAFIORI A , DUBEY A , JAUKHRI A , et al . The Llama 3 herd of models [J ] . arXiv preprint , 2024 : 2407 .21783.
DU Z X , QIAN Y J , LIU X , et al . GLM: general language model pretraining with autoregressive blank infilling [J ] . arXiv preprint , 2021 : 2103 .10360.
腾讯云 . Tencent Hunyuan-Large: 开源业界参数规模最大、效果最好的Transformer结构的MoE模型 [EB ] . 2023 .
Tencent Cloud . Tencent Hunyuan-Large: the largest-scale parameter and best-performing open-source Transformer-based MoE model in the industry [EB ] . 2023 .
JIANG A Q , SABLAYROLLES A , ROUX A , et al . Mixtral of experts [J ] . arXiv preprint , 2024 : 2401 .04088.
通信产业网 . 万卡集群: 为什么?是什么?怎么建? [EB ] . 2024 .
Communications Industry Network . 10 000 GPU cluster: Why? What is it? How to build it?[EB] . 2024 .
JIANG Z , LIN H , ZHONG Y , et al . MegaScale: scaling large language model training to more than 10 000 GPUs [J ] . arXiv preprint , 2024 : 2402 .15627.
摩尔线程 . KUAE千卡智算中心解决方案 [EB ] . 2023 .
Moore Threads . KUAE 1 000 GPUs AI center solution [EB ] . 2023 .
东方网 . 优刻得首个国产千卡智算集群落地,支持智源千亿大模型训练 [EB ] . 2024 .
Eastday . com . Ucloud’s first domestic 1 000 GPUs AI cluster lands, supporting the training of Zhiyuan’s 100-billion-parameter large model[EB ] . 2024 .
HAGEMANN J , WEINBACH S , DOBLER K , et al . Efficient parallelization layouts for large-scale distributed model training [J ] . arXiv preprint , 2023 : 2311 .05610.
WANG G H , QIN H Y , JACOBS S A , et al . ZeRO++: extremely efficient collective communication for giant model training [J ] . arXiv preprint , 2023 : 2306 .10209.
LI S G , LIU H X , BIAN Z D , et al . Colossal-AI: a unified deep learning system for large-scale parallel training [C ] // Proceedings of the 52nd International Conference on Parallel Processing . New York : ACM Press , 2023 : 766 - 775 .
华为 , 中国信通院 . 昇腾计算产业发展白皮书 [R ] . 2024 .
Huawei , China Academy of Information and Communications Technology . White paper on the development of Ascend computing industry [R ] . 2024 .
寒武纪开发者社区 . 思元MLU370-X8智能加速卡产品手册 [EB ] . 2021 .
Cambricon Developer Community . Siyuan MLU370-X8 smart accelerator card product manual [EB ] . 2021 .
HARLAP A , NARAYANAN D , PHANISHAYEE A , et al . PipeDream: fast and efficient pipeline parallel DNN training [J ] . arXiv preprint , 2018 : 1806 .03377.
CAI Z , CAO M S , CHEN H J , et al . InternLM2 technical report [J ] . arXiv preprint , 2024 : 2403 .17297.
PanguTeam , Huawei . Pangu ultra: pushing the limits of dense large language models on ascend NPUs [J ] . arXiv preprint , 2025 : 2504 .07866.
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621