Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引：0

作者：

Xu, Haiyang ^{[1
]}

Liu, Haoxiang ^{[2
]}

Gong, Wei ^{[1
]}

Wang, Hai ^{[3
]}

Deng, Xianjun ^{[4
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China

[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷

关键词：

Mixture of experts; Knowledge distillation; Language models;

D O I：

10.1007/978-981-97-9437-9_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.

引用

页码：80 / 91

页数：12

共 50 条

[31] Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
Zhang, Kai
Li, Jinqiu
Wang, Bingqian
Meng, Haoran
APPLIED SCIENCES-BASEL, 2024, 14 (20):
[32] Estimation and group-feature selection in sparse mixture-of-experts with diverging number of parameters
Khalili, Abbas
Yang, Archer Yi
Da, Xiaonan
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2025, 237
[33] MoEVC: A Mixture of Experts Voice Conversion System With Sparse Gating Mechanism for Online Computation Acceleration
Chang, Yu-Tao
Yang, Yuan-Hong
Peng, Yu-Huai
Wang, Syu-Siang
Chi, Tai-Shih
Tsao, Yu
Wang, Hsin-Min
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[34] Knowledge Selection and Local Updating Optimization for Federated Knowledge Distillation With Heterogeneous Models
Wang, Dong
Zhang, Naifu
Tao, Meixia
Chen, Xu
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2023, 17 (01) : 82 - 97
[35] Facilitating Pornographic Text Detection for Open-Domain Dialogue Systems via Knowledge Distillation of Large Language Models
Qiu, Huachum
Zhang, Shuai
He, Hongliang
Li, Anqi
Lan, Zhenzhong
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 2313 - 2318
[36] KNOWLEDGE DISTILLATION FOR RECURRENT NEURAL NETWORK LANGUAGE MODELING WITH TRUST REGULARIZATION
Shi, Yangyang
Hwang, Mei-Yuh
Lei, Xin
Sheng, Haoyu
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7230 - 7234
[37] Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems
Baldacchino, Tara
Cross, Elizabeth J.
Worden, Keith
Rowson, Jennifer
MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2016, 66-67 : 178 - 200
[38] Weather recognition combining improved ConvNeXt models with knowledge distillation
Liu L.
Xi S.
Deng Z.
Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2023, 31 (14): : 2123 - 2134
[39] DOMAIN ADAPTATION OF DNN ACOUSTIC MODELS USING KNOWLEDGE DISTILLATION
Asami, Taichi
Masumura, Ryo
Yamaguchi, Yoshikazu
Masataki, Hirokazu
Aono, Yushi
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5185 - 5189
[40] Model-Agnostic Knowledge Distillation Between Heterogeneous Models
Shen, Jiaxin
Liu, Yanyao
Jiang, Yong
Chen, Yufeng
Han, Wenjuan
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT I, NLPCC 2024, 2025, 15359 : 245 - 257

← 1 2 3 4 5 →