Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

被引：1

作者：

Buchel, Julian ^{[1
]}

Vasilopoulos, Athanasios ^{[1
]}

Simon, William Andrew ^{[1
]}

Boybat, Irem ^{[1
]}

Tsai, Hsinyu ^{[2
]}

Burr, Geoffrey W. ^{[2
]}

Castro, Hernan ^{[3
]}

Filipiak, Bill ^{[4
]}

Le Gallo, Manuel ^{[1
]}

Rahimi, Abbas ^{[1
]}

Narayanan, Vijay ^{[5
]}

Sebastian, Abu ^{[1
]}

机构：

[1] IBM Res Europe, Ruschlikon, Switzerland

[2] IBM Res Almaden, San Jose, CA USA

[3] Micron Technol, Folsom, CA USA

[4] Micron Technol, Novi, MI USA

[5] IBM Thomas J Watson Res Ctr, Yorktown Hts, NY USA

来源：

NATURE COMPUTATIONAL SCIENCE | 2025年 / 5卷 / 01期

关键词：

MEMRISTOR; CHIP;

D O I：

10.1038/s43588-024-00753-x

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of 'mixture of experts' (MoEs) networks-networks that use conditional computing to keep computational demands low despite having many parameters-on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.

引用

页码：13 / 26

页数：22

共 50 条

[1] Efficient large language model with analog in-memory computing
Subramoney, Anand
NATURE COMPUTATIONAL SCIENCE, 2025, 5 (01): : 5 - 6
[2] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Du, Nan
Huang, Yanping
Dai, Andrew M.
Tong, Simon
Lepikhin, Dmitry
Xu, Yuanzhong
Krikun, Maxim
Zhou, Yanqi
Yu, Adams Wei
Firat, Orhan
Zoph, Barret
Fedus, Liam
Bosma, Maarten
Zhou, Zongwei
Wang, Tao
Wang, Yu Emma
Webster, Kellie
Pellat, Marie
Robinson, Kevin
Meier-Hellstern, Kathleen
Duke, Toju
Dixon, Lucas
Zhang, Kun
Le, Quoc V.
Wu, Yonghui
Chen, Zhifeng
Cui, Claire
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[3] 3D Ferrimagnetic Device for Multi-Bit Storage and Efficient In-Memory Computing
Zhang, Zhizhong
Zheng, Zhenyi
Zhang, Yue
Sun, Jinyi
Lin, Kelian
Zhang, Kun
Feng, Xueqiang
Chen, Lei
Wang, Jinkai
Wang, Guanda
Du, Yinchang
Zhang, Youguang
Bournel, Arnaud
Amiri, Pedram Khalili
Zhao, Weisheng
IEEE ELECTRON DEVICE LETTERS, 2021, 42 (02) : 152 - 155
[4] Scaling Vision-Language Models with Sparse Mixture of Experts
Shen, Sheng
Yao, Zhewei
Li, Chunyuan
Darrell, Trevor
Keutzer, Kurt
He, Yuxiong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
[5] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Lu, Xudong
Liu, Qi
Xu, Yuhui
Zhou, Aojun
Huang, Siyuan
Zhang, Bo
Yan, Junchi
Li, Hongsheng
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
[6] Ultrathin Nitride Ferroic Memory with Large ON/OFF Ratios for Analog In-memory Computing
Wang, Ding
Wang, Ping
Mondal, Shubham
Hu, Mingtao
Wu, Yuanpeng
Ma, Tao
Mi, Zetian
ADVANCED MATERIALS, 2023, 35 (20)
[7] A 3D Memristor Architecture for In-Memory Computing Demonstrated with SHA3
Aljafar, Muayad J.
Joshi, Rasika
Acken, John M.
INTERNATIONAL JOURNAL OF UNCONVENTIONAL COMPUTING, 2024, 19 (2-3) : 93 - 121
[8] CoMIC: Complementary Memristor based in-memory computing in 3D architecture
Lalchhandama, F.
Datta, Kamalika
Chakraborty, Sandip
Drechsler, Rolf
Sengupta, Indranil
Journal of Systems Architecture, 2022, 126
[9] CoMIC: Complementary Memristor based in-memory computing in 3D architecture
Lalchhandama, F.
Datta, Kamalika
Chakraborty, Sandip
Drechsler, Rolf
Sengupta, Indranil
JOURNAL OF SYSTEMS ARCHITECTURE, 2022, 126
[10] 3D AND-type NVM for In-Memory Computing of Artificial Intelligence
Lue, Hang-Ting
Wang, Keh-Chung
Lu, Chih-Yuan
2018 14TH IEEE INTERNATIONAL CONFERENCE ON SOLID-STATE AND INTEGRATED CIRCUIT TECHNOLOGY (ICSICT), 2018, : 717 - 718

← 1 2 3 4 5 →