Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

被引：3

作者：

Yao, Jinghan ^{[1
]}

Anthony, Quentin ^{[1
]}

Shafi, Aamir ^{[1
]}

Subramoni, Hari ^{[1
]}

Panda, Dhabaleswar K. ^{[1
]}

机构：

[1] Ohio State Univ Columbus, Dept Comp Sci & Engn, Columbus, OH 43210 USA

来源：

PROCEEDINGS 2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, IPDPS 2024 | 2024年

关键词：

Mixture of experts; Parallel inference; Collective communication; Generative models; Distributed system;

D O I：

10.1109/IPDPS57955.2024.00086

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In the realm of large language models (LLMs) like the Generative Pre-trained Transformer (GPT), the Mixture of Experts (MoE) paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, the deployment of GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication required for expert routing and aggregation. This communication bottleneck exacerbates the already complex computational landscape, hindering the efficient utilization of high-performance computing resources. In this paper, we propose a lightweight optimization technique called ExFlow, to largely accelerate the inference of these MoE models. We take a new perspective on alleviating the communication overhead by exploiting the inter-layer expert affinity. Unlike previous methods, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation. By proposing a context-coherent expert parallelism on distributed systems, our ExFlow design only uses one Alltoall communication to deliver the same functionality while previous methods all require two Alltoalls. By carefully examining the conditional probability in tokens' routing across multiple layers, we proved that pre-trained GPT MoE models implicitly exhibit a strong inter-layer expert affinity. We then design an efficient integer programming model to precisely capture such features and show that by properly placing the experts on corresponding GPUs, we can reduce up to 67% of tokens' cross-GPU routing latency on various hardware configurations and topologies. Our solution beats the cutting-edge Deepspeed-MoE in GPT MoE models with experts from 8 to 64, with up to 2.2x improvement in inference throughput. To the best of our knowledge, this is the first work in leveraging inter-layer expert affinity to accelerate the inference of GPT MoE models. We further provide a detailed study of how the model implicitly acquires this expert affinity at the very early training stage and how this affinity evolves and stabilizes during training.

引用

页码：915 / 925

页数：11

共 34 条

[1]

A. I. for AI, 2020, C4: The colossal clean crawled corpus

[2]

[Anonymous], The Yelp Dataset

[3]

Artetxe M, 2022, Arxiv, DOI arXiv:2112.10684

[4]

Brown TB, 2020, ADV NEUR IN, V33

[5]

Chen C., 2022, Advances in Neural Information Processing Systems, V35, P173

[6]

Child R, 2019, Arxiv, DOI arXiv:1904.10509

[7]

Fedus W, 2022, J MACH LEARN RES, V23

[8]

Gao L, 2020, Arxiv, DOI [arXiv:2101.00027, 10.48550/arXiv.2101.00027]

[9]

He JA, 2021, Arxiv, DOI arXiv:2103.13262

[10] FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models [J].

He, Jiaao ;

Zhai, Jidong ;

Antunes, Tiago ;

Wang, Haojie ;

Luo, Fuwen ;

Shi, Shangfeng ;

Li, Qin .

PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, :120-134

← 1 2 3 4 →