RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models

被引:0
|
作者
Jeon, Yunhyeong [1 ]
Jang, Minwoo [1 ]
Lee, Hwanjun [1 ]
Jung, Yeji [1 ]
Jung, Jin [2 ]
Lee, Jonggeon [2 ]
So, Jinin [2 ]
Kim, Daehoon [3 ]
机构
[1] DGIST, Daegu 42988, South Korea
[2] Samsung Elect, Hwaseong 443743, South Korea
[3] Yonsei Univ, Seoul 03722, South Korea
关键词
Graphics processing units; Transformers; Random access memory; Kernel; Computer architecture; Natural language processing; Computational modeling; Vectors; Inverters; Encoding; Processing-in-memory; transformer model; rotary positional embedding;
D O I
10.1109/LCA.2025.3535470
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce RoPIM, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. RoPIM achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, RoPIM proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that RoPIM achieves up to a 307.9x performance improvement and 914.1x energy savings compared to conventional systems.
引用
收藏
页码:41 / 44
页数:4
相关论文
共 49 条
  • [1] Runtime Support for Accelerating CNN Models on Digital DRAM Processing-in-Memory Hardware
    Shin, Yongwon
    Park, Juseong
    Hong, Jeongmin
    Sung, Hyojin
    IEEE COMPUTER ARCHITECTURE LETTERS, 2022, 21 (02) : 33 - 36
  • [2] Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-Memory
    Choi, Jungwoo
    Lee, Hyuk-Jae
    Sohn, Kyomin
    Yu, Hak-Soo
    Rhee, Chae Eun
    IEEE ACCESS, 2024, 12 : 160190 - 160204
  • [3] CPSAA: Accelerating Sparse Attention Using Crossbar-Based Processing-In-Memory Architecture
    Li, Huize
    Jin, Hai
    Zheng, Long
    Liao, Xiaofei
    Huang, Yu
    Liu, Cong
    Xu, Jiahong
    Duan, Zhuohui
    Chen, Dan
    Gui, Chuangyi
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (06) : 1741 - 1754
  • [4] RETRANSFORMER: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration
    Yang, Xiaoxuan
    Yan, Bonan
    Li, Hai
    Chen, Yiran
    2020 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED-DESIGN (ICCAD), 2020,
  • [5] A ReRAM-Based Processing-In-Memory Architecture for Hyperdimensional Computing
    Liu, Cong
    Wu, Kaibo
    Liu, Haikun
    Jin, Hai
    Liao, Xiaofei
    Duan, Zhuohui
    Xu, Jiahong
    Li, Huize
    Zhang, Yu
    Yang, Jing
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2025, 44 (02) : 512 - 524
  • [6] Accelerating Neural Network Training with Processing-in-Memory GPU
    Fei, Xiang
    Han, Jianhui
    Huang, Jianqiang
    Zheng, Weimin
    Zhang, Youhui
    2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 414 - 421
  • [7] A Ferroelectric FET-Based Processing-in-Memory Architecture for DNN Acceleration
    Long, Yun
    Kim, Daehyun
    Lee, Edward
    Saha, Priyabrata
    Mudassar, Burhan Ahmad
    She, Xueyuan
    Khan, Asif Islam
    Mukhopadhyay, Saibal
    IEEE JOURNAL ON EXPLORATORY SOLID-STATE COMPUTATIONAL DEVICES AND CIRCUITS, 2019, 5 (02): : 113 - 122
  • [8] Accelerating Relational Database Analytical Processing with Bulk-Bitwise Processing-in-Memory
    Perach, Ben
    Ronen, Ronny
    Kvatinsky, Shahar
    2023 21ST IEEE INTERREGIONAL NEWCAS CONFERENCE, NEWCAS, 2023,
  • [9] Study on Processing-in-Memory Technology based on Dataflow Architecture
    Choi, Kyu Hyun
    Hwang, Taeho
    2022 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2022,
  • [10] An Efficient Test Architecture Using Hybrid Built-In Self-Test for Processing-in-Memory
    Lee, Hayoung
    Lee, Juyong
    Kang, Sungho
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024,