Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
来源
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [1] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
    Liu, Juncai
    Wang, Jessie Hui
    Jiang, Yimin
    PROCEEDINGS OF THE 2023 ACM SIGCOMM 2023 CONFERENCE, SIGCOMM 2023, 2023, : 486 - 498
  • [2] Using Mixture of Experts to accelerate dataset distillation
    Xu, Zhi
    Fu, Zhenyong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
  • [3] Effective Compression of Language Models by Combining Pruning and Knowledge Distillation
    Chiu, Chi-Yu
    Hong, Ding-Yong
    Liu, Pangfeng
    Wu, Jan-Jan
    2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 429 - 438
  • [4] SAKD: Sparse attention knowledge distillation
    Guo, Zhen
    Zhang, Pengzhou
    Liang, Peng
    IMAGE AND VISION COMPUTING, 2024, 146
  • [5] Mixture of Prompt Experts for Natural Language Inference
    Zheng, Ziou
    Zhu, Xiaodan
    2024 IEEE CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, CCECE 2024, 2024, : 43 - 48
  • [6] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
    Yuan, Xiaoming
    Kong, Weixuan
    Luo, Zhenyu
    Xu, Minrui
    ELECTRONICS, 2024, 13 (11)
  • [7] Asymptotic properties of mixture-of-experts models
    Olteanu, M.
    Rynkiewicz, J.
    NEUROCOMPUTING, 2011, 74 (09) : 1444 - 1449
  • [8] DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation
    Yang, Mingke
    Chen, Yuqi
    Liu, Yi
    Shi, Ling
    PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 578 - 589
  • [9] DSG-KD: Knowledge Distillation From Domain-Specific to General Language Models
    Cho, Sangyeon
    Jeon, Jangyeong
    Lee, Dongjoon
    Lee, Changhee
    Kim, Junyeong
    IEEE ACCESS, 2024, 12 : 130973 - 130982
  • [10] Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts
    Park, Byeongjun
    Go, Hyojun
    Kim, Jin-Young
    Woo, Sangmin
    Ham, Seokil
    Kim, Changick
    COMPUTER VISION - ECCV 2024, PT LIII, 2025, 15111 : 461 - 477