浏览全部资源
扫码关注微信
1.中国电信股份有限公司研究院,广东 广州 510630
2.中国电信股份有限公司上海分公司,上海 200120
[ "高翔(1991- ),男,中国电信股份有限公司研究院工程师,主要研究方向为计算机系统结构、存算一体、异构计算、智算基础设置等。" ]
[ "董斌(1972- ),男,中国电信股份有限公司上海分公司正高级工程师,主要研究方向为人工智能与算力、5G业务平台等。" ]
[ "肖晴(1970- )女,博士,中国电信股份有限公司上海分公司正高级工程师,主要研究方向为智算基础设施架构、公共智算服务平台、人工智能大模型应用等。" ]
[ "简晟(1999- ),男,中国电信股份有限公司研究院工程师,主要研究方向为人工智能、智算网络、算力网络等。" ]
[ "傅德基(1986- ),男,中国电信股份有限公司研究院工程师,主要研究方向为云计算领域的先进架构芯片、下一代虚拟化、技术长期演进创新机制等。" ]
收稿日期:2024-12-01,
修回日期:2025-01-21,
纸质出版日期:2025-04-20
移动端阅览
高翔,董斌,肖晴等.智算场景下集合通信库的挑战与发展趋势[J].电信科学,2025,41(04):81-94.
GAO Xiang,DONG Bin,XIAO Qing,et al.Challenges and development trends of collective communication libraries in intelligent computing scenarios[J].Telecommunications Science,2025,41(04):81-94.
高翔,董斌,肖晴等.智算场景下集合通信库的挑战与发展趋势[J].电信科学,2025,41(04):81-94. DOI: 10.11959/j.issn.1000-0801.2025048.
GAO Xiang,DONG Bin,XIAO Qing,et al.Challenges and development trends of collective communication libraries in intelligent computing scenarios[J].Telecommunications Science,2025,41(04):81-94. DOI: 10.11959/j.issn.1000-0801.2025048.
随着人工智能(artificial intelligence,AI)大模型的发展,分布式训练对计算资源的需求显著增长,导致集群间通信的数据量大幅增加,给智算场景下的集合通信库带来了严峻的挑战。针对智算场景下计算任务的性能瓶颈以及业务需求,分析了当前集合通信库所面临的技术难题,同时也展望了集合通信库未来的发展趋势,比如开发更高效的通信算法、实现更加灵活的调度机制以及增强跨平台兼容性等,旨在为智能计算领域的研究与实际应用提供有价值的参考和支撑。
With the development of artificial intelligence (AI) large models progresses
the demand for computing resources in distributed training has increased significantly. This leads to a substantial increase in the amount of data communicated between clusters
which poses severe challenges to collective communication libraries in intelligent computing scenarios.. Focusing on the performance bottlenecks and business requirements of computing tasks in intelligent computing scenarios
the technical difficulties faced by current collective communication libraries were analyzed. At the same time
the future development trends of these libraries were also envisioned
such as developing more efficient communication algorithms
implementing more flexible scheduling mechanisms
and enhancing cross-platform compatibility
with the aim to provide valuable references and support for research and practical applications in the field of intelligent computing.
OPENAI . Introducing ChatGPT [EB ] . 2023 .
OPENAI . Introducing OpenAI o1 [EB ] . 2024 .
郭亮 , 王少鹏 , 权伟 , 等 . 面向大模型的智算网络发展研究 [J ] . 电信科学 , 2024 , 40 ( 6 ): 137 - 145 .
GUO L , WANG S P , QUAN W , et al . Research on the development of intelligent computing network for large models [J ] . Telecommunications Science , 2024 , 40 ( 6 ): 137 - 145 .
WEINGRAM A , LI Y K , QI H , et al . xCCL: a survey of industry-led collective communication libraries for deep learning [J ] . Journal of Computer Science and Technology , 2023 , 38 ( 1 ): 166 - 195 .
KIM J , KIM H . Router microarchitecture and scalability of ring topology in on-chip networks [C ] // Proceedings of the 2009 2nd International Workshop on Network on Chip Architectures . Piscataway : IEEE Press , 2009 : 5 - 10 .
BOUKNIGHT W J , DENENBERG S A , MCINTYRE D E , et al . The illiac IV system [J ] . Proceedings of the IEEE , 1972 , 60 ( 4 ): 369 - 388 .
AL-FARES M , LOUKISSAS A , VAHDAT A . A scalable, commodity data center network architecture [J ] . ACM SIGCOMM Computer Communication Review , 2008 , 38 ( 4 ): 63 - 74 .
AL-DUBAI A Y , OULD-KHAOUA M . A new scalable broadcast algorithm for multiport meshes with minimum communication steps [J ] . Microprocessors and Microsystems , 2003 , 27 ( 3 ): 101 - 113 .
THAKUR R , RABENSEIFNER R , GROPP W . Optimization of collective communication operations in MPICH [J ] . International Journal of High Performance Computing Applications , 2005 , 19 ( 1 ): 49 - 66 .
Pjesivac-Grbovic J . Towards automatic and adaptive optimizations of MPI collective operations [J ] . International Journal of High Performance Computing Applications , 2007 , 15 ( 3 ): 45 - 60 .
HUSE L P . Collective communication on dedicated clusters of workstations [M ] // Recent Advances in Parallel Virtual Machine and Message Passing Interface . Berlin, Heidelberg : Springer Berlin Heidelberg , 1999 : 469 - 476 .
BRUCK J , HO C T , KIPNIS S , et al . Efficient algorithms for all-to-all communications in multi-port message-passing systems [C ] // Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms and Architectures . New York : ACM Press , 1994 : 298 - 309 .
BROWN T , MANN B , RYDER N , et al . Language models are few-shot learners [J ] . Advances in neural information processing systems , 2020 , 33 : 1877 - 1901
HUANG Y , CHENG Y , BAPNA A , et al . Gpipe: Efficient training of giant neural networks using pipeline parallelism [J ] . Advances in Neural Information Processing Systems , 2019 , 32 .
WANG G , QIN H , JACOBS S A , et al . ZeRO++: Extremely efficient collective communication for large model training [C ] // The Twelfth International Conference on Learning Representations(ICLR) , Vienna, Austria : ICLR , 2024 .
RAJBHANDARI S , RASLEY J , RUWASE O , et al . ZeRO: memory optimizations toward training trillion parameter models [C ] // Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE Press , 2020 : 1 - 16 .
LIAN X R , HUANG Y J , LI Y C , et al . Asynchronous parallel stochastic gradient for nonconvex optimization [C ] // Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2 . New York : ACM Press , 2015 : 2737 - 2745 .
GRAHAM R L , SHIPMAN G M , BARRETT B W , et al . Open MPI: a high-performance, heterogeneous MPI [C ] // Proceedings of the 2006 IEEE International Conference on Cluster Computing . Piscataway : IEEE Press , 2006 : 1 - 9 .
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构