MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity

被引:7
|
作者
Tu, Fengbin [1 ,2 ]
Wu, Zihan [1 ]
Wang, Yiqi [1 ]
Wu, Weiwei [1 ]
Liu, Leibo [1 ]
Hu, Yang [1 ]
Wei, Shaojun [1 ]
Yin, Shouyi [1 ]
机构
[1] Tsinghua Univ, Sch Integrated Circuits, Beijing 100084, Peoples R China
[2] Hong Kong Univ Sci & Technol, Dept Elect & Comp Engn, Hong Kong, Peoples R China
关键词
Computing-in-memory (CIM); dataflow; hybrid sparsity; multimodal transformers; reconfigurable architecture; EFFICIENT; PROCESSOR;
D O I
10.1109/JSSC.2023.3305663
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Multimodal Transformers are emerging artificial intelligence (AI) models that comprehend a mixture of signals from different modalities like vision, natural language, and speech. The attention mechanism and massive matrix multiplications (MMs) cause high latency and energy. Prior work has shown that a digital computing-in-memory (CIM) network can be an efficient architecture to process Transformers while maintaining high accuracy. To further improve energy efficiency, attention-token-bit hybrid sparsity in multimodal Transformers can be exploited. The hybrid sparsity significantly reduces computation, but the irregularity also harms CIM utilization. To fully utilize the attention-token-bit hybrid sparsity of multimodal Transformers, we design a digital CIM-based accelerator called MulTCIM with three corresponding features: The long reuse elimination dynamically reshapes the attention pattern to improve CIM utilization. The runtime token pruner (RTP) removes insignificant tokens, and the modal-adaptive CIM network (MACN) exploits symmetric modal overlapping to reduce CIM idleness. The effective bitwidth-balanced CIM (EBB-CIM) macro balances input bits across in-memory multiply-accumulations (MACs) to reduce computation time. The fabricated MulTCIM consumes only 2.24 mu J/Token for the ViLBERT-base model, achieving 2.50x-5.91x lower energy than previous Transformer accelerators and digital CIM accelerators.
引用
收藏
页码:90 / 101
页数:12
相关论文
共 5 条
  • [1] An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing
    Moradifirouzabadi, Ashkan
    Dodla, Divya Sri
    Kang, Mingu
    2024 50TH IEEE EUROPEAN SOLID-STATE ELECTRONICS RESEARCH CONFERENCE, ESSERC 2024, 2024, : 353 - 356
  • [2] A 28-nm Computing-in-Memory-Based Super-Resolution Accelerator Incorporating Macro-Level Pipeline and Texture/Algebraic Sparsity
    Wu, Hao
    Chen, Yong
    Yuan, Yiyang
    Yue, Jinshan
    Fu, Xiangqu
    Ren, Qirui
    Luo, Qing
    Mak, Pui-In
    Wang, Xinghua
    Zhang, Feng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2024, 71 (02) : 689 - 702
  • [3] HARDSEA: Hybrid Analog-ReRAM Clustering and Digital-SRAM In-Memory Computing Accelerator for Dynamic Sparse Self-Attention in Transformer
    Liu, Shiwei
    Mu, Chen
    Jiang, Hao
    Wang, Yunzhengmao
    Zhang, Jinshan
    Lin, Feng
    Zhou, Keji
    Liu, Qi
    Chen, Chixiao
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (02) : 269 - 282
  • [4] H3DAtten: Heterogeneous 3-D Integrated Hybrid Analog and Digital Compute-in-Memory Accelerator for Vision Transformer Self-Attention
    Li, Wantong
    Manley, Madison
    Read, James
    Kaul, Ankit
    Bakir, Muhannad S.
    Yu, Shimeng
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (10) : 1592 - 1602
  • [5] Charge-Domain Static Random Access Memory-Based In-Memory Computing with Low-Cost Multiply-and-Accumulate Operation and Energy-Efficient 7-Bit Hybrid Analog-to-Digital Converter
    Lee, Sanghyun
    Kim, Youngmin
    ELECTRONICS, 2024, 13 (03)