浏览全部资源
扫码关注微信
1.中移(苏州)软件技术有限公司,江苏 苏州 215100
2.中国移动通信集团设计院有限公司,北京 100080
[ "牛红韦华(1989- ),女,中移(苏州)软件技术有限公司计划部副总经理,主要研究方向为云计算、人工智能等。" ]
[ "黄永宝(1990- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为云计算、人工智能等。" ]
[ "丁国强(1986- ),男,现就职于中国移动通信集团设计院有限公司,主要研究方向为云计算和人工智能技术等。" ]
黄宝(1992- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为云计算、人工智能等。
赵治稳(1990- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为云计算、人工智能等。
徐杨(1991- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为人工智能、故障诊断等。
王涛(1991- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为无线安全、智算中心等。
张锐龄(1988- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为云计算、人工智能等。
王旋(1995- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为云计算、人工智能等。
张逸翔(1996- ),男,现就职于中移(苏州)软件技术有限公司,主要研究方向为网络安全、算力网络等。
收稿日期:2025-03-12,
修回日期:2025-06-06,
纸质出版日期:2025-07-20
移动端阅览
牛红韦华,黄永宝,丁国强等.基于数据和知识驱动的超万卡智算集群稳定性保障实践[J].电信科学,2025,41(07):145-163.
NIU Hongweihua,HUANG Yongbao,DING Guoqiang,et al.A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters[J].Telecommunications Science,2025,41(07):145-163.
牛红韦华,黄永宝,丁国强等.基于数据和知识驱动的超万卡智算集群稳定性保障实践[J].电信科学,2025,41(07):145-163. DOI: 10.11959/j.issn.1000-0801.2025151.
NIU Hongweihua,HUANG Yongbao,DING Guoqiang,et al.A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters[J].Telecommunications Science,2025,41(07):145-163. DOI: 10.11959/j.issn.1000-0801.2025151.
为了解决超万卡智算集群硬件故障多、任务训练故障率居高不下、跨域问题定位困难等稳定性保障问题,提出了一种基于数据和知识驱动的保障超万卡智算集群稳定性的方案。首先,通过异构资源一体化采集技术、分布式实时大数据抽取—转换—加载(extract-transform-load,ETL)技术采集集群性能数据;然后,基于改进的自注意力机制的双向长短期记忆(self-attention-based bidirectional long short-term memory,SA-BiLSTM)网络深度学习模型实现故障诊断;最后,通过知识图谱分析匹配诊断模型输出的结果,完成故障诊断报告的输出,提升诊断模型输出的可解释性。在深度学习模型提取时序性特征时引入特征权重系数,对不同尺度提取的特征加权融合,提高模型故障诊断精度。在基于1.8万卡智算集群故障诊断仿真实验中,损失值逐渐收敛并稳定在0.047,准确率达到了98.4%。实践表明,该稳定性保障方案能有效保障大模型训练,提升智算集群的可靠性,为未来更大规模的智算集群建设与大模型训练提供坚实的基础。
A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures
persistently high task training failure rates
and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over ten thousand computing cards. The cluster performance data was collected by employing heterogeneous resource integrated collection technology and distributed real-time big data ETL techniques. Fault diagnosis was performed using an enhanced SA-BiLSTM deep learning model
improving the explainability of diagnostic model outputs via knowledge graph analysis and matching for the generation of fault diagnosis reports. In the process of extracting time series features with the deep learning model
weighted fusion of features extracted at different scales
thereby improving the accuracy of the fault diagnosis model. In fault diagnosis simulation experiments conducted on an 18 000-card cluster
it was observed that the loss value gradually converged and stabilized at 0.047
achieving an accuracy rate of 98.4%. Practical has shown that the proposed stability assurance scheme can effectively support large-scale model training and enhance the reliability of intelligent computing clusters
providing a solid foundation for the construction of larger-scale intelligent computing clusters and the training of large models in the future.
KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [J ] . arXiv preprint , 2020 , arXiv: 2001.08361 .
DEVLIN J , CHANG M W , LEE K , et al . BERT: pre-training of deep bidirectional transformers for language understanding [J ] . arXiv preprint , 2018 ,arXiv: 1810.04805 .
BROWN T B , MANN B , RYDER N , et al . Language models are few-shot learners [J ] . arXiv preprint , 2020 ,arXiv: 2005.14165 .
GRATTAFIORI A , DUBEY A , JAUKHRI A , et al . The Llama 3 herd of models [J ] . arXiv preprint , 2024 ,arXiv: 2407.21783 .
KOKOLIS A , KUCHNIK M , HOFFMAN J , et al . Revisiting reliability in large-scale machine learning research clusters [C ] // Proceedings of the 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE Press , 2025 : 1259 - 1274 .
WANG Z , JIA Z , ZHENG S , et al . GEMINI: fast failure recovery in distributed training with in-memory checkpoints [C ] // Proceedings of the 29th Symposium on Operating Systems Principles . New York : ACM Press , 2023 : 364 - 381 .
JIANG Z , LIN H , ZHONG Y , et al . MegaScale: scaling large language model training to more than 10 000 GPUs [J ] . arXiv preprint , 2024 , arXiv: 2402.15627 .
HU Q , YE Z , WANG Z , et al . Characterization of large language model development in the datacenter [J ] . arXiv preprint , 2024 , arXiv: 2403.07648 .
CUI W H , ZHANG J , ZHAO H , et al . XPUTimer: anomaly diagnostics for divergent LLM training in GPU clusters of thousand-plus scale [J ] . arXiv preprint , 2025 , arXiv: 2502.05413 .
JIANG Z H , HUANG J J , CHEN Z B , et al . L4: diagnosing large-scale LLM training failures via automated log analysis [J ] . arXiv preprint , 2025 , arXiv: 2503.20263 .
WU T Y , WANG W , YU Y H , et al . FALCON: pinpointing and mitigating stragglers for large-scale hybrid-parallel training [J ] . arXiv preprint , 2024 , arXiv: 2410.12588 .
SCHUSTER M , PALIWAL K K . Bidirectional recurrent neural networks [J ] . IEEE Transactions on Signal Processing , 1997 , 45 ( 11 ): 2673 - 2681 .
PAN H , HE X , TANG S M F , et al . An improved bearing fault diagnosis method using one-dimensional CNN and LSTM [J ] . Journal of Mechanical Engineering , 2018 , 64 ( 7-8 ): 443 - 452 .
邓志光 , 吴茜 , 朱加良 , 等 . 基于改进LSTM的核电厂传感器故障诊断研究 [J ] . 自动化仪表 , 2023 , 44 ( 6 ): 115 - 120 .
DENG Z G , WU Q , ZHU J L , et al . Research on sensor fault diagnosis in nuclear power plants based on improved LSTM [J ] . Process Automation Instrumentation , 2023 , 44 ( 6 ): 115 - 120 .
王太勇 , 王廷虎 , 王鹏 , 等 . 基于注意力机制BiLSTM的设备智能故障诊断方法 [J ] . 天津大学学报(自然科学与工程技术版) , 2020 , 53 ( 6 ): 601 - 608 .
WANG T Y , WANG T H , WANG P , et al . An intelligent fault diagnosis method based on attention-based bidirectional LSTM network [J ] . Journal of Tianjin University (Science and Technology) , 2020 , 53 ( 6 ): 601 - 608 .
茅健 , 郭玉荣 , 赵嫚 . 基于注意力机制的滚动轴承故障诊断方法 [J ] . 计算机集成制造系统 , 2023 , 29 ( 7 ): 2233 - 2244 .
MAO J , GUO Y R , ZHAO M . Fault diagnosis method of rolling bearing based on attention mechanism [J ] . Computer Integrated Manufacturing Systems , 2023 , 29 ( 7 ): 2233 - 2244 .
WANG X , HE X N , CAO Y X , et al . KGAT: knowledge graph attention network for recommendation [C ] // Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . New York : ACM Press , 2019 : 950 - 958 .
潘庆亚 , 张衡 , 刘文杰 , 等 . 基于数据和知识的5G网络故障诊断方法 [J ] . 电信科学 , 2023 , 39 ( 12 ): 53 - 64 .
PAN Q Y , ZHANG H , LIU W J , et al . Fault diagnosis method for 5G networks based on data and knowledge [J ] . Telecommunications Science , 2023 , 39 ( 12 ): 53 - 64 .
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构