Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

被引：17

作者：

Hori, Takaaki ^{[1
]}

Moritz, Niko ^{[1
]}

Hori, Chiori ^{[1
]}

Le Roux, Jonathan ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

end-to-end speech recognition; transformer; conformer; long context ASR; ATTENTION;

D O I：

10.21437/Interspeech.2021-1643

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance, obtaining a 17.3% character error rate for the HKUST dataset and 12.0%/6.3% word error rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets. The new decoding method reduces decoding time by more than 50% and further enables streaming ASR with limited accuracy degradation.

引用

页码：2097 / 2101

页数：5

共 38 条

[1] Direct Acoustics-to-Word Models for English Conversational Speech Recognition [J].

Audhkhasi, Kartik ;

Ramabhadran, Bhuvana ;

Saon, George ;

Picheny, Michael ;

Nahamoo, David .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :959-963

[2]

Ba Jimmy Lei, 2016, arXiv, DOI DOI 10.48550/ARXIV.1607.06450

[3]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[4]

Chan W., 2015, P IEEE ICASSP

[5]

Chiu C.-C., 2019, ARXIV191102242

[6]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[7] Front-End Factor Analysis for Speaker Verification [J].

Dehak, Najim ;

Kenny, Patrick J. ;

Dehak, Reda ;

Dumouchel, Pierre ;

Ouellet, Pierre .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798

[8]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[9]

Fan ZY, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P222, DOI [10.1109/asru46091.2019.9003844, 10.1109/ASRU46091.2019.9003844]

[10]

Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858

← 1 2 3 4 →