Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

被引：0

作者：

Yoon, Bokyeong ^{[1
]}

Lee, Ah-Hyun ^{[1
]}

Kim, Jinsung ^{[2
]}

Moon, Gordon Euhyun ^{[1
]}

机构：

[1] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea

[2] Chung Ang Univ, Sch Comp Sci & Engn, Seoul 06974, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

Training; Sparse matrices; Transformers; Task analysis; Computational modeling; Text categorization; Computational complexity; Graphics processing units; Sparse Transformer; sparse attention; MHA optimization;

D O I：

10.1109/ACCESS.2024.3425638

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to 2.84x training speedup, 6.87x memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.

引用

页码：131373 / 131384

页数：12

共 50 条

[31] Taming Unstructured Sparsity on GPUs via Latency-Aware Optimization
Zhu, Maohua
Xie, Yuan
PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2020,
[32] Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer
Ilinykh, Nikolai
Dobnik, Simon
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 4062 - 4073
[33] Exploring Memory Persistency Models for GPUs
Lin, Zhen
Alshboul, Mohammad
Solihin, Yan
Zhou, Huiyang
2019 28TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2019), 2019, : 310 - 322
[34] Image Deraining Transformer with Sparsity and Frequency Guidance
Song, Tianyu
Li, Pengpeng
Jin, Guiyue
Jin, Jiyu
Fan, Shumin
Chen, Xiang
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1889 - 1894
[35] Sparsity-Aware Caches to Accelerate Deep Neural Networks
Ganesan, Vinod
Sen, Sanchari
Kumar, Pratyush
Gala, Neel
Veezhinathan, Kamakoti
Raghunathan, Anand
PROCEEDINGS OF THE 2020 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2020), 2020, : 85 - 90
[36] POSTER: Exploiting the Input Sparsity to Accelerate Deep Neural Networks
Dong, Xiao
Liu, Lei
Li, Guangli
Li, Jiansong
Zhao, Peng
Wang, Xueying
Feng, Xiaobing
PROCEEDINGS OF THE 24TH SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '19), 2019, : 401 - 402
[37] MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity
Tu, Fengbin
Wu, Zihan
Wang, Yiqi
Wu, Weiwei
Liu, Leibo
Hu, Yang
Wei, Shaojun
Yin, Shouyi
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2024, 59 (01) : 90 - 101
[38] Enhancing 3D Visual Grounding with Deformable Attention Transformer and Geometry Affine Transformation: Overcoming sparsity challenges
Zhang, Can
Da, Feipeng
Gai, Shaoyan
DISPLAYS, 2025, 87
[39] Using Quadruple Precision Arithmetic to Accelerate Krylov Subspace Methods on GPUs
Mukunoki, Daichi
Takahashi, Daisuke
PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2013), PT I, 2014, 8384 : 632 - 642
[40] FlexGM: An Adaptive Runtime System to Accelerate Graph Matching Networks on GPUs
Dai, Yue
Tang, Xulong
Zhang, Youtao
2023 IEEE 41ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN, ICCD, 2023, : 348 - 356

← 1 2 3 4 5 →