Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
来源
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [21] Improving Neural Topic Models with Wasserstein Knowledge Distillation
    Adhya, Suman
    Sanyal, Debarshi Kumar
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT II, 2023, 13981 : 321 - 330
  • [22] Compact Models for Periocular Verification Through Knowledge Distillation
    Boutros, Fadi
    Damer, Naser
    Fang, Meiling
    Raja, Kiran
    Kirchbuchner, Florian
    Kuijper, Arjan
    2020 INTERNATIONAL CONFERENCE OF THE BIOMETRICS SPECIAL INTEREST GROUP (BIOSIG), 2020, P-306
  • [23] AN INVESTIGATION OF A KNOWLEDGE DISTILLATION METHOD FOR CTC ACOUSTIC MODELS
    Takashima, Ryoichi
    Li, Sheng
    Kawai, Hisashi
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5809 - 5813
  • [24] EFFICIENT KNOWLEDGE DISTILLATION FOR RNN-TRANSDUCER MODELS
    Panchapagesan, Sankaran
    Park, Daniel S.
    Chiu, Chung-Cheng
    Yuan Shangguan
    Qiao Liang
    Gruenstein, Alexander
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5639 - 5643
  • [25] New estimation and feature selection methods in mixture-of-experts models
    Khalili, Abbas
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2010, 38 (04): : 519 - 539
  • [26] Mixture of Experts for Intelligent Networks: A Large Language Model-enabled Approach
    Du, Hongyang
    Liu, Guangyuan
    Lin, Yijing
    Niyato, Dusit
    Kang, Jiawen
    Xiong, Zehui
    Kim, Dong In
    20TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC 2024, 2024, : 531 - 536
  • [27] A Light-Weight CNN for Object Detection with Sparse Model and Knowledge Distillation
    Guo, Jing-Ming
    Yang, Jr-Sheng
    Seshathiri, Sankarasrinivasan
    Wu, Hung-Wei
    ELECTRONICS, 2022, 11 (04)
  • [28] Uncertainty-Driven Knowledge Distillation for Language Model Compression
    Huang, Tianyu
    Dong, Weisheng
    Wu, Fangfang
    Li, Xin
    Shi, Guangming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2850 - 2858
  • [29] SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts
    You, Zhao
    Feng, Shulin
    Su, Dan
    Yu, Dong
    INTERSPEECH 2021, 2021, : 2077 - 2081
  • [30] Application of sparse S transform network with knowledge distillation in seismic attenuation delineation
    Liu, Nai-Hao
    Zhang, Yu-Xin
    Yang, Yang
    Liu, Rong-Chang
    Gao, Jing-Huai
    Zhang, Nan
    PETROLEUM SCIENCE, 2024, 21 (04) : 2345 - 2355