RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models

被引：0

作者：

Jeon, Yunhyeong ^{[1
]}

Jang, Minwoo ^{[1
]}

Lee, Hwanjun ^{[1
]}

Jung, Yeji ^{[1
]}

Jung, Jin ^{[2
]}

Lee, Jonggeon ^{[2
]}

So, Jinin ^{[2
]}

Kim, Daehoon ^{[3
]}

机构：

[1] DGIST, Daegu 42988, South Korea

[2] Samsung Elect, Hwaseong 443743, South Korea

[3] Yonsei Univ, Seoul 03722, South Korea

来源：

IEEE COMPUTER ARCHITECTURE LETTERS | 2025年 / 24卷 / 01期

关键词：

Graphics processing units; Transformers; Random access memory; Kernel; Computer architecture; Natural language processing; Computational modeling; Vectors; Inverters; Encoding; Processing-in-memory; transformer model; rotary positional embedding;

D O I：

10.1109/LCA.2025.3535470

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce RoPIM, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. RoPIM achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, RoPIM proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that RoPIM achieves up to a 307.9x performance improvement and 914.1x energy savings compared to conventional systems.

引用

页码：41 / 44

页数：4

共 49 条

[31] abstractPIM: Bridging the Gap Between Processing-In-Memory Technology and Instruction Set Architecture
Eliahu, Adi
Ben-Hur, Rotem
Ronen, Ronny
Kvatinsky, Shahar
2020 IFIP/IEEE 28TH INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION (VLSI-SOC), 2020, : 28 - 33
[32] PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM
Shin, Yongwon
Park, Juseong
Cho, Sungjun
Sung, Hyojin
PROCEEDINGS OF THE 21ST ACM/IEEE INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO 2023, 2023, : 249 - 262
[33] PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Chi, Ping
Li, Shuangchen
Xu, Cong
Zhang, Tao
Zhao, Jishen
Liu, Yongpan
Wang, Yu
Xie, Yuan
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 27 - 39
[34] DrPIM: An Adaptive and Less-blocking Data Replication Framework for Processing-in-Memory Architecture
Xu, Sheng
Xue, Hongyu
Luo, Le
Yan, Liang
Zou, Xingqi
PROCEEDINGS OF THE GREAT LAKES SYMPOSIUM ON VLSI 2023, GLSVLSI 2023, 2023, : 385 - 389
[35] A prototype Processing-in-Memory (PIM) chip for the Data-Intensive Architecture (DIVA) system
Draper, J
Barrett, J
Sondeen, J
Mediratta, S
Kang, C
Kim, I
Daglikoca, G
JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2005, 40 (01): : 73 - 84
[36] ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture
Han, Haodong
Wang, Junpeng
Ding, Bo
Chen, Song
2024 IEEE 6TH INTERNATIONAL CONFERENCE ON AI CIRCUITS AND SYSTEMS, AICAS 2024, 2024, : 179 - 183
[37] A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System
Jaffrey Draper
J. Tim Barrett
Jeff Sondeen
Sumit Mediratta
Chang Woo Kang
Ihn Kim
Gokhan Daglikoca
Journal of VLSI signal processing systems for signal, image and video technology, 2005, 40 : 73 - 84
[38] Look-up-Table Based Processing-in-Memory Architecture With Programmable Precision-Scaling for Deep Learning Applications
Sutradhar, Purab Ranjan
Bavikadi, Sathwika
Connolly, Mark
Prajapati, Savankumar
Indovina, Mark A.
Dinakarrao, Sai Manoj Pudukotai
Ganguly, Amlan
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (02) : 263 - 275
[39] Genetic Algorithm-Based Energy-Aware CNN Quantization for Processing-In-Memory Architecture
Kang, Beomseok
Lu, Anni
Long, Yun
Kim, Daehyun
Yu, Shimeng
Mukhopadhyay, Saibal
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2021, 11 (04) : 649 - 662
[40] Flexible Instruction Set Architecture for Programmable Look-up Table based Processing-in-Memory
Connolly, Mark
Sutradhar, Purab Ranjan
Indovina, Mark
Ganguly, Amlan
2021 IEEE 39TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2021), 2021, : 66 - 73

← 1 2 3 4 5 →