Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

被引:1
|
作者
Buchel, Julian [1 ]
Vasilopoulos, Athanasios [1 ]
Simon, William Andrew [1 ]
Boybat, Irem [1 ]
Tsai, Hsinyu [2 ]
Burr, Geoffrey W. [2 ]
Castro, Hernan [3 ]
Filipiak, Bill [4 ]
Le Gallo, Manuel [1 ]
Rahimi, Abbas [1 ]
Narayanan, Vijay [5 ]
Sebastian, Abu [1 ]
机构
[1] IBM Res Europe, Ruschlikon, Switzerland
[2] IBM Res Almaden, San Jose, CA USA
[3] Micron Technol, Folsom, CA USA
[4] Micron Technol, Novi, MI USA
[5] IBM Thomas J Watson Res Ctr, Yorktown Hts, NY USA
来源
NATURE COMPUTATIONAL SCIENCE | 2025年 / 5卷 / 01期
关键词
MEMRISTOR; CHIP;
D O I
10.1038/s43588-024-00753-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of 'mixture of experts' (MoEs) networks-networks that use conditional computing to keep computational demands low despite having many parameters-on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.
引用
收藏
页码:13 / 26
页数:22
相关论文
共 50 条
  • [1] Efficient large language model with analog in-memory computing
    Subramoney, Anand
    NATURE COMPUTATIONAL SCIENCE, 2025, 5 (01): : 5 - 6
  • [2] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
    Du, Nan
    Huang, Yanping
    Dai, Andrew M.
    Tong, Simon
    Lepikhin, Dmitry
    Xu, Yuanzhong
    Krikun, Maxim
    Zhou, Yanqi
    Yu, Adams Wei
    Firat, Orhan
    Zoph, Barret
    Fedus, Liam
    Bosma, Maarten
    Zhou, Zongwei
    Wang, Tao
    Wang, Yu Emma
    Webster, Kellie
    Pellat, Marie
    Robinson, Kevin
    Meier-Hellstern, Kathleen
    Duke, Toju
    Dixon, Lucas
    Zhang, Kun
    Le, Quoc V.
    Wu, Yonghui
    Chen, Zhifeng
    Cui, Claire
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [3] 3D Ferrimagnetic Device for Multi-Bit Storage and Efficient In-Memory Computing
    Zhang, Zhizhong
    Zheng, Zhenyi
    Zhang, Yue
    Sun, Jinyi
    Lin, Kelian
    Zhang, Kun
    Feng, Xueqiang
    Chen, Lei
    Wang, Jinkai
    Wang, Guanda
    Du, Yinchang
    Zhang, Youguang
    Bournel, Arnaud
    Amiri, Pedram Khalili
    Zhao, Weisheng
    IEEE ELECTRON DEVICE LETTERS, 2021, 42 (02) : 152 - 155
  • [4] Scaling Vision-Language Models with Sparse Mixture of Experts
    Shen, Sheng
    Yao, Zhewei
    Li, Chunyuan
    Darrell, Trevor
    Keutzer, Kurt
    He, Yuxiong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
  • [5] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
    Lu, Xudong
    Liu, Qi
    Xu, Yuhui
    Zhou, Aojun
    Huang, Siyuan
    Zhang, Bo
    Yan, Junchi
    Li, Hongsheng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
  • [6] Ultrathin Nitride Ferroic Memory with Large ON/OFF Ratios for Analog In-memory Computing
    Wang, Ding
    Wang, Ping
    Mondal, Shubham
    Hu, Mingtao
    Wu, Yuanpeng
    Ma, Tao
    Mi, Zetian
    ADVANCED MATERIALS, 2023, 35 (20)
  • [7] A 3D Memristor Architecture for In-Memory Computing Demonstrated with SHA3
    Aljafar, Muayad J.
    Joshi, Rasika
    Acken, John M.
    INTERNATIONAL JOURNAL OF UNCONVENTIONAL COMPUTING, 2024, 19 (2-3) : 93 - 121
  • [8] CoMIC: Complementary Memristor based in-memory computing in 3D architecture
    Lalchhandama, F.
    Datta, Kamalika
    Chakraborty, Sandip
    Drechsler, Rolf
    Sengupta, Indranil
    Journal of Systems Architecture, 2022, 126
  • [9] CoMIC: Complementary Memristor based in-memory computing in 3D architecture
    Lalchhandama, F.
    Datta, Kamalika
    Chakraborty, Sandip
    Drechsler, Rolf
    Sengupta, Indranil
    JOURNAL OF SYSTEMS ARCHITECTURE, 2022, 126
  • [10] 3D AND-type NVM for In-Memory Computing of Artificial Intelligence
    Lue, Hang-Ting
    Wang, Keh-Chung
    Lu, Chih-Yuan
    2018 14TH IEEE INTERNATIONAL CONFERENCE ON SOLID-STATE AND INTEGRATED CIRCUIT TECHNOLOGY (ICSICT), 2018, : 717 - 718