Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
来源
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [31] Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
    Zhang, Kai
    Li, Jinqiu
    Wang, Bingqian
    Meng, Haoran
    APPLIED SCIENCES-BASEL, 2024, 14 (20):
  • [32] Estimation and group-feature selection in sparse mixture-of-experts with diverging number of parameters
    Khalili, Abbas
    Yang, Archer Yi
    Da, Xiaonan
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2025, 237
  • [33] MoEVC: A Mixture of Experts Voice Conversion System With Sparse Gating Mechanism for Online Computation Acceleration
    Chang, Yu-Tao
    Yang, Yuan-Hong
    Peng, Yu-Huai
    Wang, Syu-Siang
    Chi, Tai-Shih
    Tsao, Yu
    Wang, Hsin-Min
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [34] Knowledge Selection and Local Updating Optimization for Federated Knowledge Distillation With Heterogeneous Models
    Wang, Dong
    Zhang, Naifu
    Tao, Meixia
    Chen, Xu
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2023, 17 (01) : 82 - 97
  • [35] Facilitating Pornographic Text Detection for Open-Domain Dialogue Systems via Knowledge Distillation of Large Language Models
    Qiu, Huachum
    Zhang, Shuai
    He, Hongliang
    Li, Anqi
    Lan, Zhenzhong
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 2313 - 2318
  • [36] KNOWLEDGE DISTILLATION FOR RECURRENT NEURAL NETWORK LANGUAGE MODELING WITH TRUST REGULARIZATION
    Shi, Yangyang
    Hwang, Mei-Yuh
    Lei, Xin
    Sheng, Haoyu
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7230 - 7234
  • [37] Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems
    Baldacchino, Tara
    Cross, Elizabeth J.
    Worden, Keith
    Rowson, Jennifer
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2016, 66-67 : 178 - 200
  • [38] Weather recognition combining improved ConvNeXt models with knowledge distillation
    Liu L.
    Xi S.
    Deng Z.
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2023, 31 (14): : 2123 - 2134
  • [39] DOMAIN ADAPTATION OF DNN ACOUSTIC MODELS USING KNOWLEDGE DISTILLATION
    Asami, Taichi
    Masumura, Ryo
    Yamaguchi, Yoshikazu
    Masataki, Hirokazu
    Aono, Yushi
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5185 - 5189
  • [40] Model-Agnostic Knowledge Distillation Between Heterogeneous Models
    Shen, Jiaxin
    Liu, Yanyao
    Jiang, Yong
    Chen, Yufeng
    Han, Wenjuan
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT I, NLPCC 2024, 2025, 15359 : 245 - 257