Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

被引:0
|
作者
Yoon, Bokyeong [1 ]
Lee, Ah-Hyun [1 ]
Kim, Jinsung [2 ]
Moon, Gordon Euhyun [1 ]
机构
[1] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea
[2] Chung Ang Univ, Sch Comp Sci & Engn, Seoul 06974, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Training; Sparse matrices; Transformers; Task analysis; Computational modeling; Text categorization; Computational complexity; Graphics processing units; Sparse Transformer; sparse attention; MHA optimization;
D O I
10.1109/ACCESS.2024.3425638
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to 2.84x training speedup, 6.87x memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.
引用
收藏
页码:131373 / 131384
页数:12
相关论文
共 50 条
  • [31] Taming Unstructured Sparsity on GPUs via Latency-Aware Optimization
    Zhu, Maohua
    Xie, Yuan
    PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2020,
  • [32] Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer
    Ilinykh, Nikolai
    Dobnik, Simon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 4062 - 4073
  • [33] Exploring Memory Persistency Models for GPUs
    Lin, Zhen
    Alshboul, Mohammad
    Solihin, Yan
    Zhou, Huiyang
    2019 28TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2019), 2019, : 310 - 322
  • [34] Image Deraining Transformer with Sparsity and Frequency Guidance
    Song, Tianyu
    Li, Pengpeng
    Jin, Guiyue
    Jin, Jiyu
    Fan, Shumin
    Chen, Xiang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1889 - 1894
  • [35] Sparsity-Aware Caches to Accelerate Deep Neural Networks
    Ganesan, Vinod
    Sen, Sanchari
    Kumar, Pratyush
    Gala, Neel
    Veezhinathan, Kamakoti
    Raghunathan, Anand
    PROCEEDINGS OF THE 2020 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2020), 2020, : 85 - 90
  • [36] POSTER: Exploiting the Input Sparsity to Accelerate Deep Neural Networks
    Dong, Xiao
    Liu, Lei
    Li, Guangli
    Li, Jiansong
    Zhao, Peng
    Wang, Xueying
    Feng, Xiaobing
    PROCEEDINGS OF THE 24TH SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '19), 2019, : 401 - 402
  • [37] MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity
    Tu, Fengbin
    Wu, Zihan
    Wang, Yiqi
    Wu, Weiwei
    Liu, Leibo
    Hu, Yang
    Wei, Shaojun
    Yin, Shouyi
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2024, 59 (01) : 90 - 101
  • [38] Enhancing 3D Visual Grounding with Deformable Attention Transformer and Geometry Affine Transformation: Overcoming sparsity challenges
    Zhang, Can
    Da, Feipeng
    Gai, Shaoyan
    DISPLAYS, 2025, 87
  • [39] Using Quadruple Precision Arithmetic to Accelerate Krylov Subspace Methods on GPUs
    Mukunoki, Daichi
    Takahashi, Daisuke
    PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2013), PT I, 2014, 8384 : 632 - 642
  • [40] FlexGM: An Adaptive Runtime System to Accelerate Graph Matching Networks on GPUs
    Dai, Yue
    Tang, Xulong
    Zhang, Youtao
    2023 IEEE 41ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN, ICCD, 2023, : 348 - 356