Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

被引：0

作者：

Yoon, Bokyeong ^{[1
]}

Lee, Ah-Hyun ^{[1
]}

Kim, Jinsung ^{[2
]}

Moon, Gordon Euhyun ^{[1
]}

机构：

[1] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea

[2] Chung Ang Univ, Sch Comp Sci & Engn, Seoul 06974, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

Training; Sparse matrices; Transformers; Task analysis; Computational modeling; Text categorization; Computational complexity; Graphics processing units; Sparse Transformer; sparse attention; MHA optimization;

D O I：

10.1109/ACCESS.2024.3425638

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to 2.84x training speedup, 6.87x memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.

引用

页码：131373 / 131384

页数：12

共 50 条

[1] Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs
Dong, Shi
Sun, Yifan
Agostini, Nicolas Bohm
Karimi, Elmira
Lowell, Daniel
Zhou, Jing
Cano, Jose
Abellan, Jose L.
Kaeli, David
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (10) : 2448 - 2463
[2] BGS: Accelerate GNN training on multiple GPUs
Tan, Yujuan
Bai, Zhuoxin
Liu, Duo
Zeng, Zhaoyang
Gan, Yan
Ren, Ao
Chen, Xianzhang
Zhong, Kan
JOURNAL OF SYSTEMS ARCHITECTURE, 2024, 153
[3] Leveraging Data Density and Sparsity for Efficient SVM Training on GPUs
Xu, Borui
Wen, Zeyi
Yan, Lifeng
Zhao, Zhan
Yin, Zekun
Liu, Weiguo
He, Bingsheng
23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023, 2023, : 698 - 707
[4] BaSFormer: A Balanced Sparsity Regularized Attention Network for Transformer
Jiang, Shuoran
Chen, Qingcai
Xiang, Yang
Pan, Youcheng
Wu, Xiangping
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2125 - 2140
[5] TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training
Mahmoud, Mostafa
Edo, Isak
Zadeh, Ali Hadi
Awad, Omar Mohamed
Pekhimenko, Gennady
Albericio, Jorge
Moshovos, Andreas
2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 781 - 795
[6] Accelerate Graph Neural Network Training by Reusing Batch Data on GPUs
Ran, Zhejiang
Lai, Zhiquan
Zhang, Lizhi
Li, Dongsheng
2021 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE (IPCCC), 2021,
[7] ET: Re -Thinking Self-Attention for Transformer Models on GPUs
Chen, Shiyang
Huang, Shaoyi
Pandey, Santosh
Li, Bingbing
Gao, Guang R.
Zheng, Long
Ding, Caiwen
Liu, Hang
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
[8] Using GPUs to Accelerate CAD Algorithms
Croix, John F.
Gulati, Kanupriya
Khatri, Sunil P.
IEEE DESIGN & TEST, 2013, 30 (01) : 8 - 16
[9] Exploiting GPUs to Accelerate Clustering Algorithms
Al-Ayyoub, Mahmoud
Yaseen, Qussai
Shehab, Moahmmed A.
Jararweh, Yaser
Albalas, Firas
Benkhelifa, Elhadj
2016 IEEE/ACS 13TH INTERNATIONAL CONFERENCE OF COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2016,
[10] LightSeq2: Accelerated Training for Transformer-Based Models on GPUs
Wang, Xiaohui
Wei, Yang
Xiong, Ying
Huang, Guyue
Qian, Xian
Ding, Yufei
Wang, Mingxuan
Li, Lei
SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,

← 1 2 3 4 5 →