Efficient Transformer Inference with Statically Structured Sparse Attention

被引:3
作者
Dai, Steve [1 ]
Genc, Hasan [2 ]
Venkatesan, Rangharajan [1 ]
Khailany, Brucek [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC | 2023年
关键词
D O I
10.1109/DAC56929.2023.10247993
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention matrices of Transformers are often highly sparse because the relevant context of each token is typically limited to just a few other tokens in the sequence. To reduce the computational burden of self-attention on Transformer inference, we propose static, structured, sparse attention masks that split attention matrices into dense regions, skipping computations outside these regions while reducing computations inside these regions. To support the proposed mask structure, we design an entropy-aware finetuning algorithm to naturally encourage attention sparsity while maximizing task accuracy. Furthermore, we extend a typical dense deep learning accelerator to efficiently exploit our structured sparsity pattern. Compared to a dense baseline, we achieve 56.6% reduction in energy consumption, 58.9% performance improvement with <1% accuracy loss and 2.6% area overhead.
引用
收藏
页数:6
相关论文
共 27 条
[1]  
Beltagy I., 2020, ARXIV
[2]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[3]  
Child R., 2019, ARXIV
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]  
Dongwook Lee, 2019, arXiv
[6]  
Dosovitskiy A., 2020, ICLR 2021
[7]   A3: Accelerating Attention Mechanisms in Neural Networks with Approximation [J].
Ham, Tae Jun ;
Jung, Sung Jun ;
Kim, Seonghak ;
Oh, Young H. ;
Park, Yeonhong ;
Song, Yoonho ;
Park, Jung-Hun ;
Lee, Sanghee ;
Park, Kyoung ;
Lee, Jae W. ;
Jeong, Deog-Kyoon .
2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020), 2020, :328-341
[8]   MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks [J].
Jang, Hanhwi ;
Kim, Joonsung ;
Jo, Jae-Eon ;
Lee, Jaewon ;
Kim, Jangwoo .
PROCEEDINGS OF THE 2019 46TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '19), 2019, :250-263
[9]  
Koehn Philipp, 2020, Neural machine translation
[10]  
Liu Luyan, 2021, arXiv