
浏览全部资源
扫码关注微信
1.中国电信研究院,中国 北京 102209
2.中国电信研究院,中国 广州 510630
Received:10 October 2025,
Revised:2025-12-31,
Accepted:11 May 2026,
移动端阅览
SUN Mengyu, TAN Yukai, Corresponding Author, et al. Prefill and Decode Disaggregation Deployment Optimization for Mixture-of-Experts Model in Distributed Inference Systems[J/OL]. Telecommunications Science, 2026.
SUN Mengyu, TAN Yukai, Corresponding Author, et al. Prefill and Decode Disaggregation Deployment Optimization for Mixture-of-Experts Model in Distributed Inference Systems[J/OL]. Telecommunications Science, 2026. DOI: 10.11959/j.issn.1000-0801.DXKX250592.
通过将混合专家模型推理过程解耦为计算密集型的预填充阶段和
访存
密集型的解码阶段,分别部署在分布式物理计算节点,
从而实现推理系统的高效运行提升模型推理效率。
本文详细阐述了MoE大模型推理过程,基于Attention层和MoE层的推理过程,面向单推理任务和批处理推理任务过程进行在线推理系统建模,通过计算预填充阶段和解码阶段的计算和传输时延获得吞吐,旨在实现两阶段的吞吐量平衡。提出了一种基于二分查找算法的资源配置和策略部署机制,以确定每阶段的计算资源配比、部署实例数量和并行策略。
在两种主流计算节点进行验证,并与非PD分离基线方法和当前主流的分离式推理优化方法进行对比,
实验结果表明,
PD分离式推理相较于非PD分离基线方案,可达到3倍以上的吞吐提升,相较于当前主流方案仍有性能提升,且本文提出的
机制能够简化手动配置调优流程,为不同输入输出长度、并发数、请求频率条件下找到近似最优
的PD分离部署决策
,
相较于其它可行PD资源配比决策,
单卡平均吞吐量提升30-50%。
By decoupling the inference process of the Mixture of Experts (MoE) model into a computationally intensive prefill phase and a memory-intensive decode phase
and deploying them on distributed physical computing nodes respectively
the efficient operation of inference systems is achieved
and the model inference efficiency is improved.
This paper elaborates on the inference process of the MoE model in detail. Based on the inference processes of the Attention layer and the MoE layer
it models the online inference system for both single inference tasks and batch inference tasks. Through calculating the computation and transmission latency of the prefill phase and the decode phase
the throughput is obtained
aiming to achieve throughput balance between the two phases. A resource configuration and strategy deployment mechanism based on the binary search algorithm is proposed to determine the computation resource ratio
the number of deployed instances
and the parallel strategy for each phase.
Experimental verification is conducted on two mainstream computing nodes
with comparisons made against both non-PD-disaggregation baseline and state-of-the-art PD disaggregation inference optimization approaches. Experimental results show that compared with non-PD-disaggregation baseline approach
PD disaggregation inference achieves a throughput improvement of more than 3x. It still outperforms state-of-the-art approach in performance. Additionally
the mechanism proposed in this paper can simplify the manual configuration and tuning process
finding near-optimal PD disaggregation deployment decisions under different conditions of input/output lengths
concurrency numbers
and request frequencies. Compared with other feasible PD resource allocation decisions
the average throughput per card is increased by 30-50%.
HOFFMANN J , BORGEAUD S , MENSCH A , et al . Training compute-optimal large language models [C ] . Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans : Curran Associates Inc. , 2022 : 30016 - 30030 .
CAI W , JIANG J , WANG F , et al . A survey on mixture of experts in large language models [J ] . IEEE Transactions on Knowledge and Data Engineering , 2025 , 37 ( 7 ): 3896 – 3915 .
问佳琳 , 李晓军 , 姚俊萍 , 等 . 算力约束下混合专家模型计算优化方法:现状及研究进展 [J ] . 计算机工程与应用 , 2025 , 61 ( 22 ): 20 - 35 .
WEN J L , LI X J , YAO J P , et al . Computational optimization methods for mixture-of-experts models under computing power constraints: current status and research progress [J ] . Computer Engineering and Applications , 2025 , 61 ( 22 ): 20 - 35 .
史宏志 , 赵健 , 赵雅倩 , 等 . 大模型时代的混合专家系统优化综述 [J ] . 计算机研究与发展 , 2025 , 62 ( 5 ): 1164 - 1189 .
SHI H Z , ZHAO J , ZHAO Y Q , et al . Survey on optimization of mixture-of-experts systems in the era of large models [J ] . Journal of Computer Research and Development , 2025 , 62 ( 5 ): 1164 - 1189 .
ZHANG Y , LIANG W , XU Z , et al . AoI-aware inference services in edge computing via digital twin network slicing [J ] . IEEE Transactions on Services Computing , 2024 , 17 ( 6 ): 3154 – 3170 .
https://grok.com/. XAI. Grok-3 [EB/OL ] . 2025 .
https://github.com/deepseek-ai/DeepSeek-V3. DEEPSEEK-AI. DeepSeek-V3 [EB/OL ] . [ 2025-09-25 ] .
DAI D , DENG C , ZHAO C , et al . DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models [J ] . arXiv preprint arXiv:2401. 06066 , 2024
徐子恒 , 雷波 , 孙一豪 , 等 . 异构智算中心分布式 PD 分离推理技术的探索与实践 [J ] . 通信世界 , 2025 ( 18 ): 41 - 42 .
XU Z H , LEI B , SUN Y H , et al . Exploration and practice of distributed prefill and decode disaggregation inference technology in heterogeneous intelligent computing centers [J ] . Communications World , 2025 ( 18 ): 41 - 42 .
PATEL P , CHOUKSE E , ZHANG C , et al . Splitwise: Efficient generative LLM inference using phase splitting [C ] . 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , Piscataway : IEEE Press , 2024 : 118 - 132 .
KWON W , LI Z , ZHUANG S , et al . vLLM: Easy, fast, and cheap LLM serving with PagedAttention [EB/OL ] . 2023 . https://vllm.ai/ (accessed 9 August 2023 ) https://vllm.ai/(accessed9August2023) .
ZHU R , JIANG Z , ZHANG Z , et al . Cannikin: No lagger of SLO in concurrent multiple LoRA LLM serving [J ] . IEEE Transactions on Parallel and Distributed Systems , 2025 , 36 ( 9 ): 1972 - 1984 .
KAKOLYRIS A K , MASOUROS D , XYDIS S , et al . SLO-aware GPU DVFS for energy-efficient LLM inference serving [J ] . IEEE Computer Architecture Letters , 2024 , 23 ( 2 ): 150 – 153 .
FANG J , HE Y , YU F R , et al . Large language models (LLMs) inference offloading and resource allocation in cloud-edge networks: An active inference approach [C ] . 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Piscataway : IEEE Press , 2023 : 1 – 5 .
CHOUKSE E , PATEL P , ZHANG C , et al . Splitwise: efficient generative LLM inference using phase splitting [J ] . IEEE Micro , 2025 , 45 ( 3 ): 78 - 92 .
ZHONG Y , LIU S , CHEN J , et al . DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving [C ] // 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24 ), Santa Clara : USENIX Association , 2024 : 1 - 15 .
RAJBHANDARI S , LI C , YAO Z , et al . DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale [C ] . International Conference on Machine Learning (ICML 2022), Baltimore : PMLR , 2022 : 18332 - 18346 .
JIANG H , LI Y , ZHANG C , et al . MInference 1 . 0 : Accelerating pre-filling for long-context LLMs via dynamic sparse attention[C ] . Advances in Neural Information Processing Systems 37 (NeurIPS 2024 ), 2024 : 52481 - 52515 .
FU Q , CHO M , MERTH T , et al . LazyLLM: Dynamic token pruning for efficient long context LLM inference [J ] . arXiv preprint arXiv: 2407.14057 , 2024 .
HE Z , ZHANG Y , ZHANG C , et al . TriangleMix: A lossless and efficient attention pattern for long context pre-filling [J ] . arXiv preprint arXiv: 2507.21526 , 2025 .
COBBE K , KOSARAJU V , BAVARIAN M , et al . Training verifiers to solve math word problems [J ] . arXiv preprint arXiv: 2110.14168 , 2021 .
HENDRYCKS D , BURNS T , BASART S , et al . Measuring massive multitask language understanding [C ] . 9th International Conference on Learning Representations (ICLR 2021 ), Virtual Event , 2021 : 4762 - 4772 .
MIRIYALA V P K , SVIRIDOV G , CHEN B X , et al . Latency-optimal load balancing for distributed MoE inference [C ] . Proceedings of the 1st Workshop on Inter-networking challenges for AI (INET4AI '25), New York : Association for Computing Machinery , 2025 : 7 – 13 .
STRATI F , MCALLISTER S , PHANISHAYEE A , et al . DéjàVu: KV-cache streaming for fast, fault-tolerant generative LLM serving [C ] . Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna : Proceedings of Machine Learning Research (PMLR) , 2024 : 49605 - 49626 .
0
Views
0
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621