A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network Acceleration

被引:0
作者
Xu, Jiahong [1 ]
Liu, Haikun [1 ]
Peng, Xiaoyang [1 ]
Duan, Zhuohui [1 ]
Liao, Xiaofei [1 ]
Jin, Hai [1 ]
机构
[1] Huazhong University of Science and Technology, Hubei, Wuhan
基金
中国国家自然科学基金;
关键词
analog-to-digital conversion; PIM; ReRAM; Transformer;
D O I
10.1145/3701034
中图分类号
学科分类号
摘要
Emerging resistive random-access memory (ReRAM) based processing-in-memory (PIM) accelerators have been increasingly explored in recent years because they can efficiently perform in-situ matrix-vector multiplication (MVM) operations involved in a wide spectrum of artificial neural networks. However, there remain significant challenges to apply existing ReRAM-based PIM accelerators to the most popular Transformer neural networks. Since Transformers involve a series of matrix-matrix multiplication (MatMul) operations with data dependencies, they should write intermediate results of MatMuls to ReRAM crossbar arrays for further processing. Conventional ReRAM-based PIM accelerators often suffer from high latency of ReRAM writes and intra-layer pipeline stalls.In this paper, we propose ReCAT, a ReRAM-based PIM accelerator designed particularly for Transformers. ReCAT exploits transimpedance amplifiers (TIAs) to cascade a pair of crossbar arrays for MatMul operations involved in the self-attention mechanism. The intermediate result of a MatMul generated by one crossbar array can be directly mapped to another crossbar array, avoiding costly analog-to-digital conversions. In this way, ReCAT allows MVM operations to overlap with the corresponding data mapping, hiding the high latency of ReRAM writes. Furthermore, we propose an analog-to-digital converter (ADC) virtualization scheme to dynamically share scarce ADCs among a group of crossbar arrays, and thus significantly improve the utilization of ADCs to eliminate the performance bottleneck of MVM operations. Experimental results show that ReCAT achieves 207.3×, 2.11×, and 3.06× performance improvement on average compared with other Transformer acceleration solutions - GPUs, ReBert, and ReTransformer, respectively. © 2024 Copyright held by the owner/author(s).
引用
收藏
相关论文
共 53 条
  • [1] Ankit A., El Hajj I., Rahul Chalamalasetti S., Ndu G., Foltin M., Stanley Williams R., Faraboschi P., Hwu W.-M.W., Paul Strachan J., Roy K., Milojicic D.S., PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference, Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 715-731, (2019)
  • [2] Balasubramonian R., Kahng A.B., Muralimanohar N., Shafiee A., Srinivas V., CACTI 7: New tools for interconnect exploration in innovative off-chip memories, ACM Transactions on Architecture and Code Optimization, 14, 2, (2017)
  • [3] Challapalle N., Rampalli S., Song L., Chandramoorthy N., Swaminathan K., Sampson J., Chen Y., Narayanan V., GaaS-X: Graph analytics accelerator supporting sparse data representation using crossbar architectures, Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 433-445, (2020)
  • [4] Charan G., Hazra J., Beckmann K., Du X., Krishnan G., Joshi R.V., Cady N.C., Cao Y., Accurate inference with inaccurate RRAM devices: Statistical data, model transfer, and on-line adaptation, Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1-6, (2020)
  • [5] Chen P.-Y., Peng X., Yu S., NeuroSim: A circuit-level macromodel for benchmarking neuroinspired architectures in online learning, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37, 12, pp. 3067-3080, (2018)
  • [6] Chi P., Li S., Xu C., Zhang T., Zhao J., Liu Y., Wang Y., Xie Y., PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory, Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 27-39, (2016)
  • [7] Chi Y., Yue J., Liao X., Liu H., Jin H., A hybrid memory architecture supporting finegrained data migration, Frontiers of Computer Science, 18, 2, (2024)
  • [8] Chou T., Tang W., Botimer J., Zhang Z., CASCADE: Connecting RRAMs to extend analog dataflowin an end-to-end in-memory processing paradigm, Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO), pp. 114-125, (2019)
  • [9] (2023)
  • [10] Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186, (2019)