Adaptive Gating in Mixture-of-Experts based Language Models

被引:0
作者
Li, Jiamin [1 ]
Su, Qiang [1 ]
Yang, Yitao [2 ]
Jiang, Yimin
Wang, Cong [1 ]
Xu, Hong [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models, such as OpenAI's Chat-GPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.
引用
收藏
页码:3577 / 3587
页数:11
相关论文
共 41 条
  • [1] ai.google, AI GOOGL PALM 2
  • [2] [Anonymous], 2013, P EMP METH NAT LANG
  • [3] Heterogeneous Multi-Task Learning With Expert Diversity
    Aoki, Raquel
    Tung, Frederick
    Oliveira, Gabriel L.
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (06) : 3093 - 3102
  • [4] Bengio Y., 2009, P INT C MACH LEARN, P41, DOI [DOI 10.1145/1553374.1553380, 10.1145/1553374.1553380]
  • [5] Chen Tianlong, 2023, P ICLR
  • [6] Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
    Chen, Zitian
    Shen, Yikang
    Ding, Mingyu
    Chen, Zhenfang
    Zhao, Hengshuang
    Learned-Miller, Erik
    Gan, Chuang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11828 - 11837
  • [7] Dai Y., 2022, ARXIV
  • [8] Davison Joe, 2020, PROCEEDINGS, P38
  • [9] Du N, 2022, PR MACH LEARN RES
  • [10] Superpixel/voxel medical image segmentation algorithm based on the regional interlinked value
    Fang, Lingling
    Wang, Xin
    Wang, Mengyi
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2021, 24 (04) : 1685 - 1698