浏览全部资源
扫码关注微信
上海交通大学,上海 200240
[ "叶通(1976- ),男,博士,上海交通大学副教授,主要研究方向为数据中心光网络、光交换与光互联架构、光网络性能分析。" ]
[ "胡卫生(1964- ),男,博士,上海交通大学教授,主要研究方向为光交换结构、全光通信网等。" ]
收稿日期:2025-03-18,
修回日期:2025-04-09,
纸质出版日期:2025-04-20
移动端阅览
叶通,胡卫生.大规模智算中心光电交换网络架构演化综述[J].电信科学,2025,41(04):32-43.
YE Tong,HU Weisheng.Overview of large-scale optical-electrical switching networks for artificial intelligent data center (AIDC)[J].Telecommunications Science,2025,41(04):32-43.
叶通,胡卫生.大规模智算中心光电交换网络架构演化综述[J].电信科学,2025,41(04):32-43. DOI: 10.11959/j.issn.1000-0801.2025116.
YE Tong,HU Weisheng.Overview of large-scale optical-electrical switching networks for artificial intelligent data center (AIDC)[J].Telecommunications Science,2025,41(04):32-43. DOI: 10.11959/j.issn.1000-0801.2025116.
随着智算中心规模向百万卡级演进,以“数据中心光互联(data center optical interconnection,DCI)+电分组交换(electrical packet switching,EPS)”为特征的传统智算中心网络面临功耗高、时延高、可靠性不足的挑战。近几年工业界开始探索引入光子技术的方案,以降低智算中心网络的功耗并增强其扩展性、灵活性和可靠性。回顾了工业界研究的“DCI+EPS+光线路交换(optical circuit switching,OCS)”和“DCI+光分组交换(fast optical switching,FOS)”两类智算中心网络架构。结合工业界头部企业的实际案例及科研机构的相关探索,探讨了两种架构的技术路径、性能优势及待研究问题,为未来智算中心网络的设计提供参考。
With the explosive growth in the scale of artificial intelligence data centers (AIDC)
traditional AIDC networks characterized by “data center optical interconnection (DCI) + electrical packet switching (EPS)” are increasingly challenged in terms of power consumption
communication latency
and reliability. To address this issue
photonic technologies have been introduced in recent years to reduce the power consumption and enhance the scalability
flexibility
and reliability of AIDCs. Two types of network architectures—“DCI + EPS + optical circuit switching (OCS)” and “DCI + fast optical switching (FOS)”—that had been studied were reviewed. Combining the practices of leading enterprises and academic institutions
the technical pathways
performance advantages
and issues yet to be studied of these proposals were discussed. Insights were provided to guide the design of future large-scale AIDC networks.
KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [EB ] . 2020 .
中国算力大会 . 中国智算中心服务发展报告 [R ] . 2024 .
China Computational Power Conference . China artificial intelligence data center service development report [R ] . 2024 .
中国电信 . 智算产业发展研究报告(2024) [R ] . 2024 .
China Telecom . Artificial intelligence data industry development research report (2024) [R ] . 2024 .
工业和信息化部 , 等 . 工业和信息化部等七部门关于推动未来产业创新发展的实施意见 [R ] . 2024 .
MIIT , et al . Implementation opinions of seven departments including MIIT on promoting the innovation and development of future industries [R ] . 2024 .
CHOPRA R . Looking beyond 400 G: a syetem vendor perspective [R ] . 2020 .
百度 . 智算中心网络架构白皮书 [R ] . 2023 .
Baidu . Artificial intelligence data center network architecture white paper [R ] . 2023 .
JOUPPI N , KURIAN G , LI S , et al . TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings [C ] // Proceedings of the 50th Annual International Symposium on Computer Architecture . New York : ACM Press , 2023 : 1 - 14 .
华为 . 迈向智能世界白皮书2024—数据通信 [R ] . 2024 .
HUAWEI . Stepping into the smart world white paper 2024-data communication [R ] . 2024 .
LIU H , URATA R , YASUMURA K , et al . Lightwave fabrics: at-scale optical circuit switching for datacenter and machine learning systems [C ] // Proceedings of the 2024 IEEE 37th International Conference on Micro Electro Mechanical Systems (MEMS) . Piscataway : IEEE Press , 2024 : 156 - 161 .
SATO K I , HASEGAWA H , NIWA T , et al . A large-scale wavelength routing optical switch for data center networks [J ] . IEEE Communications Magazine , 2013 , 51 ( 9 ): 46 - 52 .
POUTIEVSKI L , MASHAYEKHI O , ONG J , et al . Jupiter evolving: transforming Google’s datacenter network via optical circuit switches and software-defined networking [C ] // Proceedings of the ACM SIGCOMM 2022 Conference . New York : ACM Press , 2022 : 66 - 85 .
PATRONAS G , TERZENIDIS N , KASHINKUNTI P , et al . Optical switching for data centers and advanced computing systems [J ] . Journal of Optical Communications and Networking , 2025 , 17 ( 1 ): A87 - A95 .
朱宸 , 周谞 , 王佩龙 . 可重构OCS技术在大模型预训练中的应用(特邀) [J ] . 光通信研究 , 2024 ( 5 ): 29 - 38 .
ZHU C , ZHOU X , WANG P L . Application of reconfigurable OCS technology for pre-training large language models [J ] . Study on Optical Communications , 2024 ( 5 ): 29 - 38 .
WANG W , et al . TopoOpt: co-optimizing network topology and parallelization strategy for distributed training jobs [C ] // Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation . New York : ACM Press , 2023 : 739 - 767 .
GU H X , YU X S , LU Y F , et al . X-NEST+: a high bandwidth and reconfigurable optical interconnects for distributed machine learning and high-performance computing [C ] // Proceedings of the 2023 Optical Fiber Communications Conference and Exhibition (OFC) . Piscataway : IEEE Press , 2023 : 1 - 3 .
LIAO X D , SUN Y J , TIAN H , et al . mFabric: an efficient and scalable fabric for mixture-of-experts training [EB ] . 2025 .
YANG Y , WANG J . Designing WDM optical interconnects with full connectivity by using limited wavelength conversion [C ] // Proceedings of the 18th International Parallel and Distributed Processing Symposium , Piscataway : IEEE Press , 2004 : 35 .
YE T , LEE T T , HU W S . AWG-based non-blocking Clos networks [J ] . IEEE/ACM Transactions on Networking , 2015 , 23 ( 2 ): 491 - 504 .
QIAO L , TANG W J , CHU T . 32 × 32 silicon electro-optic switch with built-in monitors and balanced-status units [J ] . Scientific Reports , 2017 , 7 : 42306 .
JIANG J , GOODWILL D J , DUMAIS P , et al . 16 × 16 silicon photonic switch with nanosecond switch time and low-crosstalk architecture [C ] // Proceedings of the 45th European Conference on Optical Communication (ECOC 2019) . London : IET , 2019: 1 - 4 .
TUCKER R S . Optical fiber telecommunications V B [M ] . Amsterdam : Elsevier , 2008 : 695 - 737 .
OTTINO A , BENJAMIN J , ZERVAS G . RAMP: a flat nanosecond optical network and MPI operations for distributed deep learning systems [J ] . Optical Switching and Networking , 2024 , 51 : 100761 .
WANG C , YOSHIKANE N , ELSON D , et al . Modoru: Clos nanosecond optical switching for distributed deep training [J ] . Journal of Optical Communications and Networking , 2024 , 16 ( 1 ): A40 - A52 .
LIN J M , SHEN G X , ZHAI Z W , et al . Delivering distributed machine learning services in all-optical datacenter networks with torus topology [C ] // Proceedings of the 2021 Asia Communications and Photonics Conference (ACP) . Piscataway : IEEE Press , 2021 : 1 - 3 .
LI W Z , YUAN G J , WANG Z , et al . Fast and scalable all-optical network architecture for distributed deep learning [J ] . Journal of Optical Communications and Networking , 2024 , 16 ( 3 ): 342 - 357 .
KHANI M , GHOBADI M , ALIZADEH M , et al . SiP-ML: high-bandwidth optical network interconnects for machine learning training [C ] // Proceedings of the 2021 ACM SIGCOMM 2021 Conference . New York : ACM Press , 2021 : 657 - 675 .
YE T , LEE T T , GE M , et al . Modular AWG-based interconnection for large-scale data center networks [J ] . IEEE Transactions on Cloud Computing , 2018 , 6 ( 3 ): 785 - 799 .
YIN Y W , PROIETTI R , NITTA C J , et al . AWGR-based all-to-all optical interconnects using limited number of wavelengths [C ] // Proceedings of the 2013 Optical Interconnects Conference . Piscataway : IEEE Press , 2013 : 47 - 48 .
GUO Y Z , XUE X W , GUO B L , et al . AWGR-based all-optical switching network for distributed machine learning [J ] . Optics Express , 2025 , 33 ( 1 ): 829 - 841 .
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构