HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts via HyperNetwork

被引:0
|
作者
Do, Giang [2 ]
Le, Khiem [3 ]
Pham, Quang [1 ]
TrungTin Nguyen [4 ]
Thanh-Nam Doan
Nguyen, Binh T. [5 ]
Liu, Chenghao [6 ]
Ramasamy, Savitha [1 ]
Li, Xiaoli [1 ]
Hoi, Steven [7 ]
机构
[1] ASTAR, Inst Infocomm Res I2R, Singapore, Singapore
[2] Univ Tennessee, Chattanooga, TN USA
[3] VinUniv, Hanoi, Vietnam
[4] Univ Grenoble Alpes, CNRS, Inria, Grenoble INP,LJK, F-38000 Grenoble, France
[5] Vietnam Natl Univ Ho Chi Minh City, Univ Sci, AISIA Lab, Ho Chi Minh City, Vietnam
[6] Salesforce Res Asia, Beijing, Peoples R China
[7] Singapore Management Univ, Singapore, Singapore
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be suboptimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces HyperRouter, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of HyperRouter compared to existing routing methods. Our implementation is publicly available at https://github.com/giangdip2410/HyperRouter.
引用
收藏
页码:5754 / 5765
页数:12
相关论文
共 34 条
  • [1] Sparse Bayesian Hierarchical Mixture of Experts and Variational Inference
    Iikubo, Yuji
    Horii, Shunsuke
    Matsushima, Toshiyasu
    PROCEEDINGS OF 2018 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA2018), 2018, : 60 - 64
  • [2] Efficient Routing in Sparse Mixture-of-Experts
    Shamsolmoali, Pourya (pshams55@gmail.com), 1600, Institute of Electrical and Electronics Engineers Inc.
  • [3] MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services
    Yu, Dianhai
    Shen, Liang
    Hao, Hongxiang
    Gong, Weibao
    Wu, Huachao
    Bian, Jiang
    Dai, Lirong
    Xiong, Haoyi
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (05) : 2626 - 2639
  • [4] Sparse Mixture of Local Experts for Efficient Speech Enhancement
    Sivaraman, Aswin
    Kim, Minje
    INTERSPEECH 2020, 2020, : 4526 - 4530
  • [5] HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts
    Zhao, Hao
    Qiu, Zihan
    Wu, Huijia
    Wang, Zili
    He, Zhaofeng
    Fu, Jie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 10605 - 10618
  • [6] REGULARIZED GRADIENT DESCENT TRAINING OF STEERED MIXTURE OF EXPERTS FOR SPARSE IMAGE REPRESENTATION
    Bochinski, Erik
    Jongebloed, Rolf
    Tok, Michael
    Sikora, Thomas
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 3873 - 3877
  • [7] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
    Liu, Juncai
    Wang, Jessie Hui
    Jiang, Yimin
    PROCEEDINGS OF THE 2023 ACM SIGCOMM 2023 CONFERENCE, SIGCOMM 2023, 2023, : 486 - 498
  • [8] Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
    Kudugunta, Sneha
    Huang, Yanping
    Bapna, Ankur
    Krikun, Maxim
    Lepikhin, Dmitry
    Thang Luong
    Firat, Orhan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3577 - 3599
  • [9] DisPFL: Towards Communication-Efficient Personalized Federated Learning via Decentralized Sparse Training
    Dai, Rong
    Shen, Li
    He, Fengxiang
    Tian, Xinmei
    Tao, Dacheng
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [10] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
    Yuan, Xiaoming
    Kong, Weixuan
    Luo, Zhenyu
    Xu, Minrui
    ELECTRONICS, 2024, 13 (11)