Astral 3.0: Design and Practice of High-Performance Network Infrastructure for Large-Scale MoE Training and Inference

WANG Yachen; XIA Yinben; WANG Zibo; CAO Peirui; WANG Zhibin

doi:10.11959/j.issn.1000-0801.DXKX260015

您当前的位置：

首页 >

文章列表页 >

Astral 3.0: Design and Practice of High-Performance Network Infrastructure for Large-Scale MoE Training and Inference

更新时间：2026-05-11

- Astral 3.0: Design and Practice of High-Performance Network Infrastructure for Large-Scale MoE Training and Inference
- Telecommunications Science (2026)
- 作者机构：
  
  1.腾讯科技（深圳）有限公司，广东深圳 518000
  2.南京大学计算机软件新技术国家重点实验室，江苏南京 210000
- 作者简介：
- 基金信息：
- DOI：10.11959/j.issn.1000-0801.DXKX260015
  CLC： TP393
- Received：07 January 2026，
  
  Revised：2026-01-29，
  
  Accepted：09 April 2026，
- 稿件说明：
移动端阅览
WANG Yachen, XIA Yinben, WANG Zibo, et al. Astral 3.0: Design and Practice of High-Performance Network Infrastructure for Large-Scale MoE Training and Inference[J/OL]. Telecommunications Science, 2026.
DOI：

WANG Yachen, XIA Yinben, WANG Zibo, et al. Astral 3.0: Design and Practice of High-Performance Network Infrastructure for Large-Scale MoE Training and Inference[J/OL]. Telecommunications Science, 2026. DOI： 10.11959/j.issn.1000-0801.DXKX260015.

摘要

随着大模型架构向稀疏化混合专家模型（MoE）演进，训练及推理场景下的通信开销在端到端时延中的占比显著上升，通信性能逐渐成为制约系统性能的关键因素。针对大规模 MoE 训练与推理场景中 All-to-All 通信压力大、带宽及时延敏感以及运维复杂度激增等挑战，本文提出了一套软硬件协同的高性能网络基础设施解决方案。首先，在架构层面，本文利用光 Shuffle 技术构建扁平化的二级单轨网络，设计了星脉 3.0 网络架构，适配MoE All-to-All流量特征，显著提升了通信性能并降低了组网成本。其次，在通信软件层面，本文根据训练和推理中各个阶段的不同流量特点，分别进行针对性的 All-to-All 通信内核优化，利用以GPU为中心的下发技术及专家粒度的负载均衡技术，实现了适配训练与 Prefill 阶段的高带宽内核及适配 Decode 阶段的低时延内核，大幅降低了端到端时延。最后，在运维层面，本文利用AI Agent全面优化网络系统运维流程，实现了故障的主动预警与智能化交互诊断，保障了长周期训练的连续性与在线服务的高可用性。实验结果表明，该方案有效打破了 MoE 模型的通信墙，为万亿参数模型的大规模训练与在线服务提供了统一的高性能、高可靠系统底座。

Abstract

With the evolution of large model architectures towards the sparse Mixture-of-Experts (MoE)

the proportion of communication overhead in end-to-end latency was observed to rise significantly in both training and inference scenarios

and communication performance gradually became a critical factor constraining system performance. To address challenges such as heavy All-to-All communication pressure

sensitivity to bandwidth and latency

and surging operational complexity in large-scale MoE training and inference scenarios

a high-performance network infrastructure solution based on hardware-software co-design was proposed in this paper. First

at the architecture level

the Astral 3.0 network architecture was designed by utilizing Optical Shuffle technology to construct a flattened two-layer single-rail network. This architecture was adapted to the All-to-All traffic characteristics of MoE

significantly improving communication performance and reducing networking costs. Second

at the communication software level

targeted All-to-All communication kernel optimizations were performed based on the distinct traffic characteristics of various stages in training and inference. By utilizing GPU-centric task dispatch technology and expert-granularity load balancing technology

high-bandwidth kernels adapted for training and Prefill stages

as well as low-latency kernels adapted for the Decode stage

were implemented

which drastically reduced end-to-end latency. Finally

at the operations level

network system operational workflows were comprehensively optimized using AI Agents

and proactive fault warning along with intelligent interactive diagnosis were achieved

ensuring the continuity of long-term training and the high availability of online services. Experimental results demonstrated that the communication wall in MoE models was effectively broken by this solution

providing a unified

high-performance

and highly reliable system foundation for the large-scale training and online service of trillion-parameter models.

关键词

Keywords

references

国家数据局 . 国家数据局：国内多数模型训练使用中文数据占比超 60 %[EB/OL ] . ( 2025-08-19 )[ 2025-12-08 ] . https://www.gov.cn/lianbo/bumen/202508/content_7037033.htm https://www.gov.cn/lianbo/bumen/202508/content_7037033.htm .

Tech Investments . A Niche Winner in the AI Data Center [EB/OL ] . ( 2025-06-28 )[ 2025-12-08 ] . https://www.techinvestments.io/p/a-niche-winner-in-the-ai-data-center https://www.techinvestments.io/p/a-niche-winner-in-the-ai-data-center .

LIU Z , LIN Y , CAO Y , et al . Swin transformer: Hierarchical vision transformer using shifted windows [C ] // Proceedings of the IEEE/CVF International Conference on Computer Vision . 2021 : 10012 - 10022 .

YUAN J , GAO H , DAI D , et al . Native sparse attention: Hardware-aligned and natively trainable sparse attention [C ] // Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2025 : 23078 - 23097 .

FEDUS W , ZOPH B , SHAZEER N . Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity [J ] . Journal of Machine Learning Research , 2022 , 23 ( 120 ): 1 - 39 .

DU N , HUANG Y , DAI A M , et al . Glam: Efficient scaling of language models with mixture-of-experts [C ] . Proceedings of the International Conference on Machine Learning , 2022 : 5547 - 5569 .

LIU A , FENG B , WANG B , et al . Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model [J ] . arXiv preprint arXiv: 2405.04434 , 2024 .

YANG A , YANG B , HUI B , et al . Qwen2 technical report [J ] . arXiv preprint arXiv: 2407.10671 , 2024 .

JIANG A Q , SABLAYROLLES A , ROUX A , et al . Mixtral of experts [J ] . arXiv preprint arXiv: 2401.04088 , 2024 .

JIN Z , WANG S , ZHU J , et al . BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference [C ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 17 ): 17689 - 17698 .

JIN P , ZHU B , YUAN L , et al . Moe++: Accelerating mixture-of-experts methods with zero-computation experts [J ] . arXiv preprint arXiv: 2410.07348 , 2024 .

HWANG C , CUI W , XIONG Y , et al . Tutel: Adaptive mixture-of-experts at scale [J ] . Proceedings of Machine Learning and Systems , 2023 , 5 : 269 - 287 .

Meng Q , Zheng H , Zhang Z , et al . Astral: A Datacenter Infrastructure for Large Language Model Training at Scale [C ] // Proceedings of the ACM SIGCOMM 2025 Conference . 2025 : 609 - 625 .

Jiang Z , Lin H , Zhong Y , et al . {MegaScale}: Scaling large language model training to more than 10,000 {GPUs} [C ] // 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 2024 : 745 - 760 .

Qian K , Xi Y , Cao J , et al . Alibaba hpn: A data center network for large language model training [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . 2024 : 691 - 706 .

Gangidi A , Miao R , Zheng S , et al . Rdma over ethernet for distributed training at meta scale [C ] // Proceedings of the ACM SIGCOMM 2024 Conference . 2024 : 57 - 70 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Cloud native intelligent operation and maintenance technology

Research on intelligent cloud native architecture and key technologies for cloud and network integration

Locating causes of abnormality of EPG experience based on decision tree

Related Author

Long XUE

Gang LU

Qi ZHOU

Huiyan ZHANG

Tingjun WAN

Gang LU

Changyi CHEN

Zelong HUANG

Related Institution

No data

⁰