Prefill and Decode Disaggregation Deployment Optimization for Mixture-of-Experts Model in Distributed Inference Systems

SUN Mengyu; TAN Yukai; Corresponding Author; LU Gang; HUANG Zhilan; WANG Yasen; ZHU Zeya; Li Yiqing

doi:10.11959/j.issn.1000-0801.DXKX250592

您当前的位置：

首页 >

文章列表页 >

Prefill and Decode Disaggregation Deployment Optimization for Mixture-of-Experts Model in Distributed Inference Systems

更新时间：2026-05-11

- Prefill and Decode Disaggregation Deployment Optimization for Mixture-of-Experts Model in Distributed Inference Systems
- Telecommunications Science (2026)
- 作者机构：
  
  1.中国电信研究院，中国北京 102209
  2.中国电信研究院，中国广州 510630
- 作者简介：
- 基金信息：
- DOI：10.11959/j.issn.1000-0801.DXKX250592
  CLC：
- Received：10 October 2025，
  
  Revised：2025-12-31，
  
  Accepted：11 May 2026，
- 稿件说明：
移动端阅览
SUN Mengyu, TAN Yukai, Corresponding Author, et al. Prefill and Decode Disaggregation Deployment Optimization for Mixture-of-Experts Model in Distributed Inference Systems[J/OL]. Telecommunications Science, 2026.
DOI：

SUN Mengyu, TAN Yukai, Corresponding Author, et al. Prefill and Decode Disaggregation Deployment Optimization for Mixture-of-Experts Model in Distributed Inference Systems[J/OL]. Telecommunications Science, 2026. DOI： 10.11959/j.issn.1000-0801.DXKX250592.

摘要

通过将混合专家模型推理过程解耦为计算密集型的预填充阶段和

访存

密集型的解码阶段，分别部署在分布式物理计算节点，

从而实现推理系统的高效运行提升模型推理效率。

本文详细阐述了MoE大模型推理过程，基于Attention层和MoE层的推理过程，面向单推理任务和批处理推理任务过程进行在线推理系统建模，通过计算预填充阶段和解码阶段的计算和传输时延获得吞吐，旨在实现两阶段的吞吐量平衡。提出了一种基于二分查找算法的资源配置和策略部署机制，以确定每阶段的计算资源配比、部署实例数量和并行策略。

在两种主流计算节点进行验证，并与非PD分离基线方法和当前主流的分离式推理优化方法进行对比，

实验结果表明，

PD分离式推理相较于非PD分离基线方案，可达到3倍以上的吞吐提升，相较于当前主流方案仍有性能提升，且本文提出的

机制能够简化手动配置调优流程，为不同输入输出长度、并发数、请求频率条件下找到近似最优

的PD分离部署决策

，

相较于其它可行PD资源配比决策，

单卡平均吞吐量提升30-50%。

Abstract

By decoupling the inference process of the Mixture of Experts (MoE) model into a computationally intensive prefill phase and a memory-intensive decode phase

and deploying them on distributed physical computing nodes respectively

the efficient operation of inference systems is achieved

and the model inference efficiency is improved.

This paper elaborates on the inference process of the MoE model in detail. Based on the inference processes of the Attention layer and the MoE layer

it models the online inference system for both single inference tasks and batch inference tasks. Through calculating the computation and transmission latency of the prefill phase and the decode phase

the throughput is obtained

aiming to achieve throughput balance between the two phases. A resource configuration and strategy deployment mechanism based on the binary search algorithm is proposed to determine the computation resource ratio

the number of deployed instances

and the parallel strategy for each phase.

Experimental verification is conducted on two mainstream computing nodes

with comparisons made against both non-PD-disaggregation baseline and state-of-the-art PD disaggregation inference optimization approaches. Experimental results show that compared with non-PD-disaggregation baseline approach

PD disaggregation inference achieves a throughput improvement of more than 3x. It still outperforms state-of-the-art approach in performance. Additionally

the mechanism proposed in this paper can simplify the manual configuration and tuning process

finding near-optimal PD disaggregation deployment decisions under different conditions of input/output lengths

concurrency numbers

and request frequencies. Compared with other feasible PD resource allocation decisions

the average throughput per card is increased by 30-50%.

关键词

Keywords

references

HOFFMANN J , BORGEAUD S , MENSCH A , et al . Training compute-optimal large language models [C ] . Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans : Curran Associates Inc. , 2022 : 30016 - 30030 .

CAI W , JIANG J , WANG F , et al . A survey on mixture of experts in large language models [J ] . IEEE Transactions on Knowledge and Data Engineering , 2025 , 37 ( 7 ): 3896 – 3915 .

问佳琳 , 李晓军 , 姚俊萍 , 等 . 算力约束下混合专家模型计算优化方法：现状及研究进展 [J ] . 计算机工程与应用 , 2025 , 61 ( 22 ): 20 - 35 .

WEN J L , LI X J , YAO J P , et al . Computational optimization methods for mixture-of-experts models under computing power constraints: current status and research progress [J ] . Computer Engineering and Applications , 2025 , 61 ( 22 ): 20 - 35 .

史宏志 , 赵健 , 赵雅倩 , 等 . 大模型时代的混合专家系统优化综述 [J ] . 计算机研究与发展 , 2025 , 62 ( 5 ): 1164 - 1189 .

SHI H Z , ZHAO J , ZHAO Y Q , et al . Survey on optimization of mixture-of-experts systems in the era of large models [J ] . Journal of Computer Research and Development , 2025 , 62 ( 5 ): 1164 - 1189 .

ZHANG Y , LIANG W , XU Z , et al . AoI-aware inference services in edge computing via digital twin network slicing [J ] . IEEE Transactions on Services Computing , 2024 , 17 ( 6 ): 3154 – 3170 .

https://grok.com/. XAI. Grok-3 [EB/OL ] . 2025 .

https://github.com/deepseek-ai/DeepSeek-V3. DEEPSEEK-AI. DeepSeek-V3 [EB/OL ] . [ 2025-09-25 ] .

DAI D , DENG C , ZHAO C , et al . DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models [J ] . arXiv preprint arXiv:2401. 06066 , 2024

徐子恒 , 雷波 , 孙一豪 , 等 . 异构智算中心分布式 PD 分离推理技术的探索与实践 [J ] . 通信世界 , 2025 ( 18 ): 41 - 42 .

XU Z H , LEI B , SUN Y H , et al . Exploration and practice of distributed prefill and decode disaggregation inference technology in heterogeneous intelligent computing centers [J ] . Communications World , 2025 ( 18 ): 41 - 42 .

PATEL P , CHOUKSE E , ZHANG C , et al . Splitwise: Efficient generative LLM inference using phase splitting [C ] . 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , Piscataway : IEEE Press , 2024 : 118 - 132 .

KWON W , LI Z , ZHUANG S , et al . vLLM: Easy, fast, and cheap LLM serving with PagedAttention [EB/OL ] . 2023 . https://vllm.ai/ (accessed 9 August 2023 ) https://vllm.ai/(accessed9August2023) .

ZHU R , JIANG Z , ZHANG Z , et al . Cannikin: No lagger of SLO in concurrent multiple LoRA LLM serving [J ] . IEEE Transactions on Parallel and Distributed Systems , 2025 , 36 ( 9 ): 1972 - 1984 .

KAKOLYRIS A K , MASOUROS D , XYDIS S , et al . SLO-aware GPU DVFS for energy-efficient LLM inference serving [J ] . IEEE Computer Architecture Letters , 2024 , 23 ( 2 ): 150 – 153 .

FANG J , HE Y , YU F R , et al . Large language models (LLMs) inference offloading and resource allocation in cloud-edge networks: An active inference approach [C ] . 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Piscataway : IEEE Press , 2023 : 1 – 5 .

CHOUKSE E , PATEL P , ZHANG C , et al . Splitwise: efficient generative LLM inference using phase splitting [J ] . IEEE Micro , 2025 , 45 ( 3 ): 78 - 92 .

ZHONG Y , LIU S , CHEN J , et al . DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving [C ] // 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24 ), Santa Clara : USENIX Association , 2024 : 1 - 15 .

RAJBHANDARI S , LI C , YAO Z , et al . DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale [C ] . International Conference on Machine Learning (ICML 2022), Baltimore : PMLR , 2022 : 18332 - 18346 .

JIANG H , LI Y , ZHANG C , et al . MInference 1 . 0 : Accelerating pre-filling for long-context LLMs via dynamic sparse attention[C ] . Advances in Neural Information Processing Systems 37 (NeurIPS 2024 ), 2024 : 52481 - 52515 .

FU Q , CHO M , MERTH T , et al . LazyLLM: Dynamic token pruning for efficient long context LLM inference [J ] . arXiv preprint arXiv: 2407.14057 , 2024 .

HE Z , ZHANG Y , ZHANG C , et al . TriangleMix: A lossless and efficient attention pattern for long context pre-filling [J ] . arXiv preprint arXiv: 2507.21526 , 2025 .

COBBE K , KOSARAJU V , BAVARIAN M , et al . Training verifiers to solve math word problems [J ] . arXiv preprint arXiv: 2110.14168 , 2021 .

HENDRYCKS D , BURNS T , BASART S , et al . Measuring massive multitask language understanding [C ] . 9th International Conference on Learning Representations (ICLR 2021 ), Virtual Event , 2021 : 4762 - 4772 .

MIRIYALA V P K , SVIRIDOV G , CHEN B X , et al . Latency-optimal load balancing for distributed MoE inference [C ] . Proceedings of the 1st Workshop on Inter-networking challenges for AI (INET4AI '25), New York : Association for Computing Machinery , 2025 : 7 – 13 .

STRATI F , MCALLISTER S , PHANISHAYEE A , et al . DéjàVu: KV-cache streaming for fast, fault-tolerant generative LLM serving [C ] . Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna : Proceedings of Machine Learning Research (PMLR) , 2024 : 49605 - 49626 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰