Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引：0

作者：

Xu, Haiyang ^{[1
]}

Liu, Haoxiang ^{[2
]}

Gong, Wei ^{[1
]}

Wang, Hai ^{[3
]}

Deng, Xianjun ^{[4
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China

[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷

关键词：

Mixture of experts; Knowledge distillation; Language models;

D O I：

10.1007/978-981-97-9437-9_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.

引用

页码：80 / 91

页数：12

共 50 条

[41] Boosting the Performance of Lightweight HAR Models with Attention and Knowledge Distillation
Agac, Sumeyye
Incel, Ozlem Durmaz
2024 INTERNATIONAL CONFERENCE ON INTELLIGENT ENVIRONMENTS, IE 2024, 2024, : 1 - 8
[42] Improving Multilingual Text-to-Speech with Mixture-of-Language-Experts and Accent Disentanglement
Wu, Jing
Chen, Ting
Chen, Minchuan
Hu, Wei
Wang, Shaojun
Xiao, Jing
INTERSPEECH 2024, 2024, : 4968 - 4972
[43] Adaptive mixture-of-experts models for data glove interface with multiple users
Yoon, Jong-Won
Yang, Sung-Ihk
Cho, Sung-Bae
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 4898 - 4907
[44] New estimation in mixture of experts models using the Pearson type VII distribution
Yin, Junhui
Wu, Liucang
Lu, Hanchi
Dai, Lin
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2020, 49 (02) : 472 - 483
[45] Cross-modal knowledge distillation for continuous sign language recognition
Gao, Liqing
Shi, Peng
Hu, Lianyu
Feng, Jichao
Zhu, Lei
Wan, Liang
Feng, Wei
NEURAL NETWORKS, 2024, 179
[46] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition
Wang, Wenxuan
Ma, Guodong
Li, Yuke
Du, Binbin
INTERSPEECH 2023, 2023, : 1389 - 1393
[47] Automatic Segmentation using Knowledge Distillation with Ensemble Models (ASKDEM)
Buschiazzo, Anthony
Russell, Mason
Osteen, Philip
Uplinger, James
UNMANNED SYSTEMS TECHNOLOGY XXVI, 2024, 13055
[48] Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation
Han, Minglun
Chen, Feilong
Shi, Jing
Xu, Shuang
Xu, Bo
INTERSPEECH 2023, 2023, : 1364 - 1368
[49] Mixture density knowledge distillation in super-resolution reconstruction of mri medical images
Yu, Xiangchun
Zhou, Ningning
Zheng, Jian
Liang, Miaomiao
Qiu, Liujin
Xu, Qing
Medical Engineering and Physics, 2025, 139
[50] Bayesian shrinkage in mixture-of-experts models: identifying robust determinants of class membership
Zens, Gregor
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (04) : 1019 - 1051

← 1 2 3 4 5 →