Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

被引:1
|
作者
Buchel, Julian [1 ]
Vasilopoulos, Athanasios [1 ]
Simon, William Andrew [1 ]
Boybat, Irem [1 ]
Tsai, Hsinyu [2 ]
Burr, Geoffrey W. [2 ]
Castro, Hernan [3 ]
Filipiak, Bill [4 ]
Le Gallo, Manuel [1 ]
Rahimi, Abbas [1 ]
Narayanan, Vijay [5 ]
Sebastian, Abu [1 ]
机构
[1] IBM Res Europe, Ruschlikon, Switzerland
[2] IBM Res Almaden, San Jose, CA USA
[3] Micron Technol, Folsom, CA USA
[4] Micron Technol, Novi, MI USA
[5] IBM Thomas J Watson Res Ctr, Yorktown Hts, NY USA
来源
NATURE COMPUTATIONAL SCIENCE | 2025年 / 5卷 / 01期
关键词
MEMRISTOR; CHIP;
D O I
10.1038/s43588-024-00753-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of 'mixture of experts' (MoEs) networks-networks that use conditional computing to keep computational demands low despite having many parameters-on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.
引用
收藏
页码:13 / 26
页数:22
相关论文
共 50 条
  • [31] Exploring the Feasibility of Using 3-D XPoint as an In-Memory Computing Accelerator
    Zabihi, Masoud
    Resch, Salonik
    Cilasun, Husrev
    Chowdhury, Zamshed, I
    Zhao, Zhengyang
    Karpuzcu, Ulya R.
    Wang, Jian-Ping
    Sapatnekar, Sachin S.
    IEEE JOURNAL ON EXPLORATORY SOLID-STATE COMPUTATIONAL DEVICES AND CIRCUITS, 2021, 7 (02): : 88 - 96
  • [32] Anisotropic scaling for 3D topological models
    Rufo, S.
    Griffith, M. A. R.
    Lopes, Nei
    Continentino, Mucio A.
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [33] A Heterogeneous Platform for 3D NAND-Based In-Memory Hyperdimensional Computing Engine for Genome Sequencing Applications
    Hsu, Po-Kai
    Garg, Vaidehi
    Lu, Anni
    Yu, Shimeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2024, 71 (04) : 1628 - 1637
  • [34] Anisotropic scaling for 3D topological models
    S. Rufo
    M. A. R. Griffith
    Nei Lopes
    Mucio A. Continentino
    Scientific Reports, 11
  • [35] Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models
    Zhu, Shaolin
    Jian, Dong
    Xiong, Deyi
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
  • [36] 2D Molecular Ferroelectric with Large Out-of-plane Polarization for In-Memory Computing
    Yao, Jie
    Feng, Zi-Jie
    Hu, Zhenliang
    Xiong, Yu-An
    Pan, Qiang
    Du, Guo-Wei
    Ji, Hao-Ran
    Sha, Tai-Ting
    Lu, Junpeng
    You, Yu-Meng
    ADVANCED FUNCTIONAL MATERIALS, 2024, 34 (22)
  • [37] Discovering the In-Memory Kernels of 3D Dot-Product Engines
    Rashed, Muhammad Rashedul Haq
    Jha, Sumit Kumar
    Ewetz, Rickard
    2023 28TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC, 2023, : 240 - 245
  • [38] GP3D: 3D NAND Based In-Memory Graph Processing Accelerator
    Shim, Wonbo
    Yu, Shimeng
    IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2022, 12 (02) : 500 - 507
  • [39] Efficient Ray Tracing of Large 3D Scenes for Mobile Distributed Computing Environments
    Seo, Woong
    Park, Sanghun
    Ihm, Insung
    SENSORS, 2022, 22 (02)
  • [40] ICE: An Intelligent Cognition Engine with 3D NAND-based In-Memory Computing for Vector Similarity Search Acceleration
    Hu, Han-Wen
    Wang, Wei-Chen
    Chang, Yuan-Hao
    Lee, Yung-Chun
    Lin, Bo-Rong
    Wang, Huai -Mu
    Lin, Yen-Po
    Huang, Yu -Ming
    Lee, Chong-Ying
    Su, Tzu-Hsiang
    Hsieh, Chih-Chang
    Hu, Chia -Ming
    Lai, Yi-Ting
    Chen, Chung-Kuang
    Chen, Han -Sung
    Li, Hsiang -Pang
    Kuo, Tei-Wei
    Chang, Meng -Fan
    Wang, Keh-Chung
    Hung, Chun-Hsiung
    Lu, Chih-Yuan
    2022 55TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2022, : 763 - 783