Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
来源
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [41] Boosting the Performance of Lightweight HAR Models with Attention and Knowledge Distillation
    Agac, Sumeyye
    Incel, Ozlem Durmaz
    2024 INTERNATIONAL CONFERENCE ON INTELLIGENT ENVIRONMENTS, IE 2024, 2024, : 1 - 8
  • [42] Improving Multilingual Text-to-Speech with Mixture-of-Language-Experts and Accent Disentanglement
    Wu, Jing
    Chen, Ting
    Chen, Minchuan
    Hu, Wei
    Wang, Shaojun
    Xiao, Jing
    INTERSPEECH 2024, 2024, : 4968 - 4972
  • [43] Adaptive mixture-of-experts models for data glove interface with multiple users
    Yoon, Jong-Won
    Yang, Sung-Ihk
    Cho, Sung-Bae
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 4898 - 4907
  • [44] New estimation in mixture of experts models using the Pearson type VII distribution
    Yin, Junhui
    Wu, Liucang
    Lu, Hanchi
    Dai, Lin
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2020, 49 (02) : 472 - 483
  • [45] Cross-modal knowledge distillation for continuous sign language recognition
    Gao, Liqing
    Shi, Peng
    Hu, Lianyu
    Feng, Jichao
    Zhu, Lei
    Wan, Liang
    Feng, Wei
    NEURAL NETWORKS, 2024, 179
  • [46] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition
    Wang, Wenxuan
    Ma, Guodong
    Li, Yuke
    Du, Binbin
    INTERSPEECH 2023, 2023, : 1389 - 1393
  • [47] Automatic Segmentation using Knowledge Distillation with Ensemble Models (ASKDEM)
    Buschiazzo, Anthony
    Russell, Mason
    Osteen, Philip
    Uplinger, James
    UNMANNED SYSTEMS TECHNOLOGY XXVI, 2024, 13055
  • [48] Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation
    Han, Minglun
    Chen, Feilong
    Shi, Jing
    Xu, Shuang
    Xu, Bo
    INTERSPEECH 2023, 2023, : 1364 - 1368
  • [49] Mixture density knowledge distillation in super-resolution reconstruction of mri medical images
    Yu, Xiangchun
    Zhou, Ningning
    Zheng, Jian
    Liang, Miaomiao
    Qiu, Liujin
    Xu, Qing
    Medical Engineering and Physics, 2025, 139