ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit

被引：0

作者：

Kang, Myeonggu ^{[1
]}

Park, Junyoung ^{[1
]}

Shin, Hyein ^{[1
]}

Shin, Jaekang ^{[1
]}

Kim, Lee-Sup ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol KAIST, Sch Elect Engn, Daejeon 34141, South Korea

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2024年 / 73卷 / 09期

关键词：

Decoding; Transformers; Computers; Vectors; Computational modeling; Hardware; Transformer-based language model; early exit; deep learning; natural language processing;

D O I：

10.1109/TC.2024.3404051

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Transformer-based language models have recently gained popularity in numerous natural language processing (NLP) applications due to their superior performance compared to traditional algorithms. These models involve two execution stages: summarization and generation. The generation stage accounts for a significant portion of the total execution time due to its auto-regressive property, which necessitates considerable and repetitive off-chip accesses. Consequently, our objective is to minimize off-chip accesses during the generation stage to expedite transformer execution. To achieve the goal, we propose a token-adaptive early exit (ToEx) that generates output tokens using fewer decoders, thereby reducing off-chip accesses for loading weight parameters. Although our approach has the potential to minimize data communication, it brings two challenges: 1) inaccurate self-attention computation, and 2) significant overhead for exit decision. To overcome these challenges, we introduce a methodology that facilitates accurate self-attention by lazily performing computations for previously exited tokens. Moreover, we mitigate the overhead of exit decision by incorporating a lightweight output embedding layer. We also present a hardware design to efficiently support the proposed work. Evaluation results demonstrate that our work can reduce the number of decoders by 2.6 x on average. Accordingly, it achieves 3.2 x speedup on average compared to transformer execution without our work.

引用

页码：2248 / 2261

页数：14

共 1 条

[1] Automatic Question Generation using RNN-based and Pre-trained Transformer-based Models in Low Resource Indonesian Language
Vincentio, Karissa
Suhartono, Derwin
INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (07): : 103 - 118

← 1 →