1.中国联合网络通信有限公司云南省分公司,云南 昆明 650206
2.中兴通讯股份有限公司,云南 昆明 650034
[ "张云勇(1976- ),男,博士,中国联合网络通信有限公司云南省分公司正高级工程师,百千万人才工程国家级人选,国务院政府特殊津贴专家,云南省“兴滇英才计划”产业创新人才,主要研究方向为数字经济、人工智能、信息技术、通信技术。" ]
[ "闫硕(1987- ),男,博士,中国联合网络通信有限公司云南省分公司高级工程师,建设发展部(科技创新部)科技创新管理室主任,主要研究方向为数理逻辑、形式化方法、人工智能。" ]
[ "陈永铭(1978- ),男,现就职于中兴通讯股份有限公司,主要研究方向为网络架构、片上互联、处理器微架构。" ]
[ "张启明(1970- ),男,现就职于中兴通讯股份有限公司,主要研究方向为智算网络互联架构。" ]
收稿:2025-04-28,
修回:2025-05-25,
纸质出版:2025-08-20
移动端阅览
张云勇,闫硕,陈永铭等.智算互联综述[J].电信科学,2025,41(08):22-32.
ZHANG Yunyong,YAN Shuo,CHEN Yongming,et al.A survey on intelligent computing interconnection[J].Telecommunications Science,2025,41(08):22-32.
张云勇,闫硕,陈永铭等.智算互联综述[J].电信科学,2025,41(08):22-32. DOI: 10.11959/j.issn.1000-0801.2025165.
ZHANG Yunyong,YAN Shuo,CHEN Yongming,et al.A survey on intelligent computing interconnection[J].Telecommunications Science,2025,41(08):22-32. DOI: 10.11959/j.issn.1000-0801.2025165.
随着大模型参数量突破万亿规模,智算互联面临超大规模组网、低时延通信、高带宽同步等技术挑战。研究构建了包含吞吐量、时延、扩展比等指标的多维评价体系,分析了大模型训练、人工智能(artificial intelligence,AI)推理和边缘计算三大应用场景的需求特点。通过对比主流科技企业的解决方案,总结了CLOS架构、Fat-Tree拓扑等创新实践,重点探讨了互联协议、网络拓扑、拥塞控制等关键技术,并展望了开放协议、光电融合等未来发展方向。研究表明,智算互联技术的持续创新将为AI发展提供关键基础设施支撑。
As model parameters surpass the trillion-scale mark
intelligent computing interconnection faces technical challenges including ultra-large-scale networking
low-latency communication
and high-bandwidth synchronization multidimensional evaluation. The framework incorporating key metrics such as throughput
latency
and scaling ratio was established
the distinctive requirements of three major application scenarios: large-scale model training
AI inference
and edge computing
were analyzed. Through comparative analysis of solutions from leading technology enterprises
innovative practices were summarized including CLOS architecture and Fat-Tree topology
with discussions focused on critical technologies like interconnection protocols
network topologies
and congestion control. Future development directions such as open protocols and optoelectronic integration were also outlined. The findings demonstrate that continuous innovation in intelligent computing interconnection technologies will provide crucial infrastructure support for AI development.
WASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [EB ] . 2017 .
ACHIAM J , ADLER S , AGARWAL S , et al . GPT-4 technical report [EB ] . 2023 .
BI X , CHEN D , CHEN G , et al . DeepSeek LLM: scaling open-source language models with longtermism [EB ] . 2024 .
VAVRE A , HE E , LIU D , et al . Llama 3 meets MoE: efficient upcycling [EB ] . 2024 .
JOUPPI N P , YOON D H , KURIAN G , et al . A domain-specific supercomputer for training deep neural networks [J ] . Communications of the ACM , 2020 , 63 ( 7 ): 67 - 78 .
NORRIE T , PATIL N , YOON D H , et al . Google’s training chips revealed: TPUv2 and TPUv3 [C ] // Proceedings of the 2020 IEEE Hot Chips 32 Symposium (HCS) . Piscataway : IEEE Press , 2020 : 1 - 70 .
JOUPPI N , KURIAN G , LI S , et al . TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings [C ] // Proceedings of the 50th Annual International Symposium on Computer Architecture . New York : ACM Press , 2023 : 1 - 14 .
AARON G , ABHIMANYU D , ABHINAV J , et al . The Llama 3 herd of models [EB ] . 2024 .
WANG W , GHOBADI M , SHAKERI K , et al . How to build low-cost networks for large language models (without sacrificing performance) [EB ] . 2023 .
BONATO T , KABBANI A , DE SENSI D , et al . FASTFLOW: flexible adaptive congestion control for high-performance datacenters [EB ] . 2024 .
TOMMASO B , ABDUL K , AHMAD G , et al . REPS: recycled entropy packet spraying for adaptive load balancing and failure mitigation [EB ] . 2024 .
KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [EB ] . 2020 .
WEI J , TAY Y , BOMMASANI R , et al . Emergent abilities of large language models [EB ] . 2022
GUO D , YANG D , ZHANG H , et al . DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning [EB ] . 2025 .
OpenAI . Learning to reason with LLMs [EB ] . 2024 .
BAI J Z , BAI S , CHU Y F , et al . Qwen technical report [EB ] . 2023 .
郭亮 , 王少鹏 , 权伟 , 等 . 面向大模型的智算网络发展研究 [J ] . 电信科学 , 2024 , 40 ( 6 ): 137 - 145 .
GUO L , WANG S P , QUAN W , et al . Research on the development of intelligent computing network for large models [J ] . Telecommunications Science , 2024 , 40 ( 6 ): 137 - 145 .
SHI Y M , YANG K , JIANG T , et al . Communication-efficient edge AI: algorithms and systems [J ] . IEEE Communications Surveys & Tutorials , 2020 , 22 ( 4 ): 2167 - 2191 .
刘霞 . 边缘AI新纪元正在到来 [J ] . 科技日报 , 2024 ( 4 ).
LIU X . The new era of edge AI is coming [J ] . Science and Technology Daily , 2024 ( 4 ).
JIANG Z , LIN H , ZHONG Y , et al . MegaScale: scaling large language model training to more than 10 000 GPUs [EB ] . 2024 .
KUMAR G , DUKKIPATI N , JANG K , et al . Swift: delay is simple and effective for congestion control in the datacenter [C ] // Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication . New York : ACM Press , 2020 : 514 - 528 .
ZHU Y B , ERAN H , FIRESTONE D , et al . Congestion control for large-scale RDMA deployments [C ] // Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication . New York : ACM Press , 2015 : 523 - 536 .
QIAN K , XI Y Q , CAO J M , et al . Alibaba HPN: a data center network for large language model training [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM Press , 2024 : 691 - 706 .
MIAO R , ZHU L , MA S , et al . From luna to solar: the evolutions of the compute-to-storage networks in Alibaba Cloud [C ] // Proceedings of the ACM SIGCOMM 2022 Conference . New York : ACM Press , 2022 .
DONG J B , WANG S C , FENG F , et al . ACCL: architecting highly scalable distributed training systems with highly efficient collective communication library [J ] . IEEE Micro , 2021 , 41 ( 5 ): 85 - 92 .
SI J . Development trends of large models and Tencent's independent innovation practice [J ] . Bulletin of Chinese Academy of Sciences , 2024 , 39 ( 9 ): 1631 - 1638 .
GANGIDI A , MIAO R , ZHENG S B , et al . RDMA over Ethernet for distributed training at meta scale [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . New York : ACM Press , 2024 : 57 - 70 .
SHOEYBI M , PATWARY M , PURI R , et al . Megatron-LM: training multibillion parameter language models using model parallelism [EB ] . 2020 .
VALIANT L G . A bridging model for parallel computation [J ] . Communications of the ACM , 1990 , 33 ( 8 ): 103 - 111 .
HUANG Y P , CHENG Y L , BAPNA A , et al . GPipe: efficient training of giant neural networks using pipeline parallelism [J ] . ArXiv e-Prints , 2018 , arXiv: 1811.06965 .
LAWLEY J . Understanding performance of PCI express systems [R ] . 2014 .
DAS SHARMA D , BLANKENSHIP R , BERGER D . An introduction to the compute express link (CXL) interconnect [J ] . ACM Computing Surveys , 2024 , 56 ( 11 ): 1 - 37 .
FOLEY D , DANSKIN J . Ultra-performance pascal GPU and NVLink interconnect [J ] . IEEE Micro , 2017 , 37 ( 2 ): 7 - 17 .
ARSID R . Ultra Ethernet and UALink: next-generation interconnects for AI infrastructure [J ] . IJSAT-International Journal on Science and Technology , 2025 , 16 ( 2 ): IJSAT25023103 .
HAI J , RAJKUMAR B TONI C . An introduction to the InfiniBand architecture [M ] // High Performance Mass Storage and Parallel I/O . Piscataway : IEEE , 2009 : 617 - 632 .
ZHU Y , ERAN H , FIRESTONE D , et al . Congestion control for large-scale RDMA deployments [J ] . ACM SIGCOMM Computer Communication Review , 2015 , 45 ( 4 ): 523 - 536 .
METZ J . Empowering AI workloads in ultra Ethernet consortium [C ] // Proceedings of the 2024 IEEE Photonics Society Summer Topicals Meeting Series (SUM) . Piscataway : IEEE Press , 2024 : 1 - 2 .
LEISERSON C E . Fat-Trees: universal networks for hardware-efficient supercomputing [J ] . IEEE Transactions on Computers , 1985 , C-34( 10 ): 892 - 901 .
TALPES E , WILLIAMS D , DAS SARMA D . Dojo: the microarchitecture of tesla's exa-scale computer [C ] // Proceedings of the 2022 IEEE Hot Chips 34 Symposium (HCS) . Piscataway : IEEE Press , 2022 : 1 - 28 .
DUATO J , YALAMANCHILI S , NI L . Interconnection networks—an engineering approach [M ] . Burlington : Morgan Kaufmann , 2002 .
DALLY W J , TOWLES B . Route packets, not wires: on-chip interconnection networks [C ] // Proceedings of the 38th Design Automation Conference . Piscataway : IEEE Press , 2005 : 684 - 689 .
HOEFLER T , BONATO T , DE SENSI D , et al . HammingMesh: a network topology for large-scale deep learning [C ] // Proceedings of the SC22: International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE Press , 2022 : 1 - 18 .
WANG W Y , KHAZRAEE M , ZHONG Z Z , et al . TopoOpt: co-optimizing network topology and parallelization strategy for distributed training jobs [EB ] . 2022 .
PADHYE J , FIROIU V , TOWSLEY D F , et al . Modeling TCP Reno performance: a simple model and its empirical validation [J ] . IEEE/ACM Transactions on Networking , 2000 , 8 ( 2 ): 133 - 145 .
XU L S , HARFOUSH K , RHEE I . Binary increase congestion control (BIC) for fast long-distance networks [C ] // Proceedings of the IEEE INFOCOM 2004 . Piscataway : IEEE Press , 2004 : 2514 - 2524 .
HA S , RHEE I , XU L S . Cubic: a new TCP-friendly high-speed TCP variant [J ] . ACM SIGOPS Oper Syst Rev , 2005 , 42 : 64 - 74 .
MITTAL R , LAM V T , DUKKIPATI N , et al . Timely: RTT-based congestion control for the datacenter [C ] // Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication . New York : ACM Press , 2015 : 537 - 550 .
ALIZADEH M , GREENBERG A , MALTZ D , et al . DCTCP: efficient packet transport for the commoditized data center [C ] // Proceedings of the ACM SIGCOMM 2010 Conference . New York : ACM Press , 2010 .
LI Y L , MIAO R , LIU H H , et al . HPCC: high precision congestion control [C ] // Proceedings of the ACM Special Interest Group on Data Communication . New York : ACM Press , 2019 : 44 - 58 .
LE Y F , PAN R , NEWMAN P , et al . STrack: a reliable multipath transport for AI/ML clusters [EB ] . 2024 .
HOPPS C . Analysis of an equal-cost multi-path algorithm [EB ] . 2000 .
KABBANI A , VAMANAN B , HASAN J , et al . FlowBender: flow-level adaptive routing for improved latency and throughput in datacenter networks [C ] // Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies . New York : ACM Press , 2014 : 149 - 160 .
QURESHI M A , CHENG Y , YIN Q W , et al . PLB: congestion signals are simple and effective for network load balancing [C ] // Proceedings of the ACM SIGCOMM 2022 Conference . New York : ACM Press , 2022 : 207 - 218 .
HE K Q , ROZNER E , AGARWAL K , et al . Presto: edge-based load balancing for fast datacenter networks [C ] // Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication . New York : ACM Press , 2015 : 465 - 478 .
RAICIU C , BARRE S , PLUNTKE C , et al . Improving datacenter performance and robustness with multipath TCP [J ] . ACM SIGCOMM Computer Communication Review , 2011 , 41 ( 4 ): 266 - 277 .
ERICO V , RONG P , MOHAMMAD A , et al . Let it flow: resilient asymmetric load balancing with flowlet switching [C ] // Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) . Boston : USENIX Association , 2017 : 407 – 420 .
DIXIT A , PRAKASH P , HU Y C , et al . On the impact of packet spraying in data center networks [C ] // Proceedings of the 2013 IEEE INFOCOM . Piscataway : IEEE Press , 2013 : 2130 - 2138 .
NAINENI N , JAIN S . State-of-the-Art MPI AllReduce implementations for distributed machine learning: a survey [J ] . 2024 .
HE S M , WAN W , LI J H . Network communication optimization of RCCL communication library in Multi-NIC systems [C ] // Proceedings of the Third International Conference on Algorithms, Microchips, and Network Applications (AMNA 2024) . Bellingham : SPIE , 2024 : 28 .
HSU C H , IMAM N , LANGER A , et al . An initial assessment of NVSHMEM for high performance computing [C ] // Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . Piscataway : IEEE Press , 2020 : 1 - 10 .
PUNNIYAMURTHY K , HAMIDOUCHE K , BECKMANN B M . Optimizing distributed ML communication with fused computation-collective operations [C ] // Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE Press , 2024 : 1 - 17 .
SHAMIS P , VENKATA M G , LOPEZ M G , et al . UCX: an open source framework for HPC network APIs and beyond [C ] // Proceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects . Piscataway : IEEE Press , 2015 : 40 - 43 .
UNAT D , TURIMBETOV I , ISSA M K T , et al . The landscape of GPU-centric communication [J ] . arXiv preprint arXiv: 2409.09874 , 2024 .
HOU S Y , CHEN W C , HU C , et al . Wafer-level integration of an advanced logic-memory system through the second-generation CoWoS technology [J ] . IEEE Transactions on Electron Devices , 2017 , 64 ( 10 ): 4071 - 4077 .
CHO J , et al . Electrical characterization of embedded multidie interconnect bridge (EMIB) and interposer considering system bandwidth and I/O power consumption [C ] // Proceedings of the 2017 DesignCon . [ S.l. : s.n. ] , 2017 .
INGERLY D B , AMIN S , ARYASOMAYAJULA L , et al . Foveros: 3D integration and the use of face-to-face chip stacking for logic devices [C ] // Proceedings of the 2019 IEEE International Electron Devices Meeting (IEDM) . Piscataway : IEEE Press , 2019 : 19.6.1- 19 . 6 . 4 .
KIM J , KUNDU S , BALANKUTTY A , et al . A 224-Gb/s DAC-based PAM-4 quarter-rate transmitter with 8-tap FFE in 10-nm FinFET [J ] . IEEE Journal of Solid-State Circuits , 2022 , 57 ( 1 ): 6 - 20 .
KHANI M , GHOBADI M , ALIZADEH M , et al . SiP-ML: high-bandwidth optical network interconnects for machine learning training [C ] // Proceedings of the 2021 ACM SIGCOMM 2021 Conference . New York : ACM Press , 2021 : 657 - 675 .
FATHOLOLOUMI S , MALOUIN C , HUI D , et al . Highly integrated 4 Tbps silicon photonic IC for compute fabric connectivity [C ] // Proceedings of the 2022 IEEE Symposium on High-Performance Interconnects (HOTI) . Piscataway : IEEE Press , 2022 : 1 - 4 .
MOAZENI S . Next-generation co-packaged optics for future disaggregated AI systems [EB ] . 2022 .
Broadcom Ships Tomahawk 5 . Industry's highest bandwidth switch chip to accelerate AI/ML workloads[EB]. 2022 .
GUO X T , XUE X W , YAN F L , et al . DACON: a reconfigurable application-centric optical network for disaggregated data center infrastructures [J ] . Journal of Optical Communications and Networking , 2022 , 14 ( 1 ): A69 .
李莹 , 王升 , 张昊 . 打造算网智一体化AI-Native算力网络 推动全国一体化算力网纵深发展 [J ] . 通信世界 , 2025 ( 11 ): 22 - 23 .
LI Y , WANG S , ZHANG H . Building an AI-native computing network with integrated computing network and intelligence, and promoting the deep development of the national integrated computing network [J ] . Communications World , 2025 ( 11 ): 22 - 23 .
厉俊男 , 李韬 , 杨惠 . 超以太网技术的现状与展望 [J ] . 中兴通讯技术 , 2024 , 30 ( 6 ): 48 - 53 .
LI J N , LI T , YANG H . Status and prospect of ultra-Ethernet technology [J ] . ZTE Technology Journal , 2024 , 30 ( 6 ): 48 - 53 .
HAN X C , ZHAO S Z , LYU Y X , et al . LumosCore: highly scalable LLM clusters with optical interconnect [EB ] . 2024 .
WU Z G , DAI L Y , NOVICK A , et al . Peta-scale embedded photonics architecture for distributed deep learning applications [J ] . Journal of Lightwave Technology , 2023 , 41 ( 12 ): 3737 - 3749 .
李云飞 . 关于数据中心液冷技术应用现状和趋势研究 [J ] . 中国新通信 , 2022 , 24 ( 12 ): 72 - 74 .
LI Y F . Research on application status and trend of liquid cooling technology in data center [J ] . China New Telecommunications , 2022 , 24 ( 12 ): 72 - 74 .
中国信科 . 迈向绿色高效: 数据中心制冷技术的革新之旅 [J ] . 通信世界 , 2024 ( 18 ): 21 - 22 .
CICT . Innovative journey towards green and efficient data center refrigeration technology [J ] . Communications World , 2024 ( 18 ): 21 - 22 .
0
浏览量
3897
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621