A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network Acceleration

被引:0
作者
Xu, Jiahong [1 ]
Liu, Haikun [1 ]
Peng, Xiaoyang [1 ]
Duan, Zhuohui [1 ]
Liao, Xiaofei [1 ]
Jin, Hai [1 ]
机构
[1] Huazhong University of Science and Technology, Hubei, Wuhan
基金
中国国家自然科学基金;
关键词
analog-to-digital conversion; PIM; ReRAM; Transformer;
D O I
10.1145/3701034
中图分类号
学科分类号
摘要
Emerging resistive random-access memory (ReRAM) based processing-in-memory (PIM) accelerators have been increasingly explored in recent years because they can efficiently perform in-situ matrix-vector multiplication (MVM) operations involved in a wide spectrum of artificial neural networks. However, there remain significant challenges to apply existing ReRAM-based PIM accelerators to the most popular Transformer neural networks. Since Transformers involve a series of matrix-matrix multiplication (MatMul) operations with data dependencies, they should write intermediate results of MatMuls to ReRAM crossbar arrays for further processing. Conventional ReRAM-based PIM accelerators often suffer from high latency of ReRAM writes and intra-layer pipeline stalls.In this paper, we propose ReCAT, a ReRAM-based PIM accelerator designed particularly for Transformers. ReCAT exploits transimpedance amplifiers (TIAs) to cascade a pair of crossbar arrays for MatMul operations involved in the self-attention mechanism. The intermediate result of a MatMul generated by one crossbar array can be directly mapped to another crossbar array, avoiding costly analog-to-digital conversions. In this way, ReCAT allows MVM operations to overlap with the corresponding data mapping, hiding the high latency of ReRAM writes. Furthermore, we propose an analog-to-digital converter (ADC) virtualization scheme to dynamically share scarce ADCs among a group of crossbar arrays, and thus significantly improve the utilization of ADCs to eliminate the performance bottleneck of MVM operations. Experimental results show that ReCAT achieves 207.3×, 2.11×, and 3.06× performance improvement on average compared with other Transformer acceleration solutions - GPUs, ReBert, and ReTransformer, respectively. © 2024 Copyright held by the owner/author(s).
引用
收藏
相关论文
共 53 条
  • [51] Yazdanbakhsh A., Moradifirouzabadi A., Li Z., Kang M., Sparse attention acceleration with synergistic in-memory pruning and on-chip recomputation, Proceedings of 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 744-762, (2022)
  • [52] Serdar Yonar A., Andrea Francese P., Brandli M., Kossel M., Prathapan M., Morf T., Ruffino A., Jang T., An 8b 1.0-to-1.25GS/s 0.7-to-0.8V single-stage time-based gated-ringoscillator ADC with 2× interpolating sense-amplifier-latches, Proceedings of the 2023 IEEE International Solid-State Circuits Conference (ISSCC), pp. 1-3, (2023)
  • [53] Yuan G., Behnam P., Li Z., Shafiee A., Lin S., Ma X., Liu H., Qian X., Nazm Bojnordi M., Wang Y., Ding C., FORMS: Fine-grained polarized ReRAM-based in-situ computation for mixed-signal DNN accelerator, Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 265-278, (2021)