GAUSSIAN KERNELIZED SELF-ATTENTION FOR LONG SEQUENCE DATA AND ITS APPLICATION TO CTC-BASED SPEECH RECOGNITION

被引：5

作者：

Kashiwagi, Yosuke ^{[1
]}

Tsunoo, Emiru ^{[1
]}

Watanabe, Shinji ^{[2
]}

机构：

[1] Sony Corp, Tokyo, Japan

[2] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

speech recognition; end-to-end; self-attention; long sequence data;

D O I：

10.1109/ICASSP39728.2021.9413493

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques.

引用

页码：6214 / 6218

页数：5

共 26 条

[1]

[Anonymous], 2009, Advances in neural information processing systems

[2]

Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]

[3]

Dahake PP, 2016, 2016 INTERNATIONAL CONFERENCE ON AUTOMATIC CONTROL AND DYNAMIC OPTIMIZATION TECHNIQUES (ICACDOT), P1080, DOI 10.1109/ICACDOT.2016.7877753

[4]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[5] Applications of support vector machines to speech recognition [J].

Ganapathiraju, A ;

Hamaker, JE ;

Picone, J .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2004, 52 (08) :2348-2355

[6] Compact Bilinear Pooling [J].

Gao, Yang ;

Beijbom, Oscar ;

Zhang, Ning ;

Darrell, Trevor .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :317-326

[7]

Graves A., 2006, P 23 INT C MACH LEAR, P369

[8] TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation [J].

Hernandez, Francois ;

Nguyen, Vincent ;

Ghannay, Sahar ;

Tomashenko, Natalia ;

Esteve, Yannick .

SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 :198-208

[9]

Karita S, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P449, DOI [10.1109/ASRU46091.2019.9003750, 10.1109/asru46091.2019.9003750]

[10]

Kitaev Nikita, 2020, P INT C LEARNING REP

← 1 2 3 →