Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

被引:0
|
作者
Yoon, Bokyeong [1 ]
Lee, Ah-Hyun [1 ]
Kim, Jinsung [2 ]
Moon, Gordon Euhyun [1 ]
机构
[1] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea
[2] Chung Ang Univ, Sch Comp Sci & Engn, Seoul 06974, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Training; Sparse matrices; Transformers; Task analysis; Computational modeling; Text categorization; Computational complexity; Graphics processing units; Sparse Transformer; sparse attention; MHA optimization;
D O I
10.1109/ACCESS.2024.3425638
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to 2.84x training speedup, 6.87x memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.
引用
收藏
页码:131373 / 131384
页数:12
相关论文
共 50 条
  • [1] Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs
    Dong, Shi
    Sun, Yifan
    Agostini, Nicolas Bohm
    Karimi, Elmira
    Lowell, Daniel
    Zhou, Jing
    Cano, Jose
    Abellan, Jose L.
    Kaeli, David
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (10) : 2448 - 2463
  • [2] BGS: Accelerate GNN training on multiple GPUs
    Tan, Yujuan
    Bai, Zhuoxin
    Liu, Duo
    Zeng, Zhaoyang
    Gan, Yan
    Ren, Ao
    Chen, Xianzhang
    Zhong, Kan
    JOURNAL OF SYSTEMS ARCHITECTURE, 2024, 153
  • [3] Leveraging Data Density and Sparsity for Efficient SVM Training on GPUs
    Xu, Borui
    Wen, Zeyi
    Yan, Lifeng
    Zhao, Zhan
    Yin, Zekun
    Liu, Weiguo
    He, Bingsheng
    23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023, 2023, : 698 - 707
  • [4] BaSFormer: A Balanced Sparsity Regularized Attention Network for Transformer
    Jiang, Shuoran
    Chen, Qingcai
    Xiang, Yang
    Pan, Youcheng
    Wu, Xiangping
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2125 - 2140
  • [5] TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training
    Mahmoud, Mostafa
    Edo, Isak
    Zadeh, Ali Hadi
    Awad, Omar Mohamed
    Pekhimenko, Gennady
    Albericio, Jorge
    Moshovos, Andreas
    2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 781 - 795
  • [6] Accelerate Graph Neural Network Training by Reusing Batch Data on GPUs
    Ran, Zhejiang
    Lai, Zhiquan
    Zhang, Lizhi
    Li, Dongsheng
    2021 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE (IPCCC), 2021,
  • [7] ET: Re -Thinking Self-Attention for Transformer Models on GPUs
    Chen, Shiyang
    Huang, Shaoyi
    Pandey, Santosh
    Li, Bingbing
    Gao, Guang R.
    Zheng, Long
    Ding, Caiwen
    Liu, Hang
    SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
  • [8] Using GPUs to Accelerate CAD Algorithms
    Croix, John F.
    Gulati, Kanupriya
    Khatri, Sunil P.
    IEEE DESIGN & TEST, 2013, 30 (01) : 8 - 16
  • [9] Exploiting GPUs to Accelerate Clustering Algorithms
    Al-Ayyoub, Mahmoud
    Yaseen, Qussai
    Shehab, Moahmmed A.
    Jararweh, Yaser
    Albalas, Firas
    Benkhelifa, Elhadj
    2016 IEEE/ACS 13TH INTERNATIONAL CONFERENCE OF COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2016,
  • [10] LightSeq2: Accelerated Training for Transformer-Based Models on GPUs
    Wang, Xiaohui
    Wei, Yang
    Xiong, Ying
    Huang, Guyue
    Qian, Xian
    Ding, Yufei
    Wang, Mingxuan
    Li, Lei
    SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,