Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

被引：0

作者：

Zhao, Chendong ^{[1
,2
]}

Wang, Jianzong ^{[1
]}

Wei, Wenqi ^{[1
]}

Qu, Xiaoyang ^{[1
]}

Wang, Haoqian ^{[2
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China

[2] Tsinghua Univ, Shenzhen Int Grad Sch, Beijing, Peoples R China

来源：

2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA) | 2022年

关键词：

Automatic Speech Recognition; Sparse Attention; Monotonic Attention; Self-Attention;

D O I：

10.1109/DSAA54385.2022.10032360

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Transformer architecture model, based on selfattention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.

引用

页码：173 / 180

页数：8

共 49 条

[1]

[Anonymous], 2018, P 2018 EMNLP WORKSHO

[2]

Arivazhagan N, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P1313

[3]

Blondel M, 2019, PR MACH LEARN RES, V89, P606

[4]

Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449

[5]

Cai Y., 2022, European Conference on Computer Vision

[6]

Cai Y, 2021, 35 C NEURAL INFORM P

[7]

Cai Y., 2022, P IEEE C COMPUTER VI

[8]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[9]

Child R., 2019, ANN C N AM CHAPTER A

[10]

Chiu C.-C, 2017, INT C LEARNING REPRE

← 1 2 3 4 5 →