CTA: Hardware-Software Co-design for Compressed Token Attention Mechanism

被引：7

作者：

Wang, Haoran ^{[1
,2
]}

Xu, Haobo ^{[1
]}

Wang, Ying ^{[1
,2
,3
]}

Han, Yinhe ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, CICS, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Zhejiang Lab, Hangzhou, Peoples R China

来源：

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA | 2023年

基金：

中国国家自然科学基金;

关键词：

PRODUCT QUANTIZATION; EFFICIENT;

D O I：

10.1109/HPCA56546.2023.10070997

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The attention mechanism is becoming an integral part of modern neural networks, bringing breakthroughs to Natural Language Processing (NLP) applications and even Computer Vision (CV) applications. Unfortunately, the superiority of attention mechanism comes from its ability to model relations between any two positions in long sequence, which incurs high inference overhead. For state-of-the-art AI workloads such as Bert or GPT-2, attention mechanism is reported to account up to 50% of the inference overhead. Previous works seek to alleviate this performance bottleneck by removing useless relations for each position and accelerate position-specific operations. However their attempts require selecting from a sequence of relations once for each position, which is essentially frequent on-the-fly pruning and breaks the inherent parallelism in attention mechanism. In this paper, we propose CTA, an algorithm-architecture co-designed solution that can substantially reduce theoretic complexity of attention mechanism, enabling significant speedup and energy saving. Inspired by the fact that the feature sequence encoded by attention mechanism contain a large number of semantic feature repetition, we propose a novel approximation scheme that can efficiently remove that repetition, only calculating attention among necessary features thus reducing computation complexity quadratically. To utilize this algorithmic bonus and empower high performance attention mechanism inference, we devise specialized architecture to efficiently support the proposed approximation scheme. Extensive experiments show that, on average, CTA achieves 27.7x speedup, 634.0x energy savings with no accuracy loss, and 44.2x speedup, 950.0x energy savings with around 1% accuracy loss over Nvidia V100-SXM2 GPU. Also, CTA achieves 22.8x speedup, 479.6x energy savings over ELSA accelerator+GPU system.

引用

页码：429 / 441

页数：13

共 65 条

[1] Bit-Pragmatic Deep Neural Network Computing [J].

Albericio, Jorge ;

Delmas, Alberto ;

Judd, Patrick ;

Sharify, Sayeh ;

O'Leary, Gerard ;

Genov, Roman ;

Moshovos, Andreas .

50TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2017, :382-394

[2] Additive Quantization for Extreme Vector Compression [J].

Babenko, Artem ;

Lempitsky, Victor .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :931-938

[3]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]

[4] Attention Augmented Convolutional Networks [J].

Bello, Irwan ;

Zoph, Barret ;

Vaswani, Ashish ;

Shlens, Jonathon ;

Le, Quoc V. .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294

[5] Bolt: Accelerated Data Mining with Fast Vector Compression [J].

Blalock, Davis W. ;

Guttag, John V. .

KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :727-735

[6]

Brown TB, 2020, ADV NEUR IN, V33

[7] Efficient large-scale sequence comparison by locality-sensitive hashing [J].

Buhler, J .

BIOINFORMATICS, 2001, 17 (05) :419-428

[8]

Buhler Jeremy., 2001, J COMPUT BIOL, P69

[9]

Chen M, 2020, PR MACH LEARN RES, V119

[10] Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices [J].

Chen, Yu-Hsin ;

Yange, Tien-Ju ;

Emer, Joel S. ;

Sze, Vivienne .

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) :292-308

← 1 2 3 4 5 6 7 →