Direct Acoustics-to-Word Models for English Conversational Speech Recognition

被引：76

作者：

Audhkhasi, Kartik ^{[1
]}

Ramabhadran, Bhuvana ^{[1
]}

Saon, George ^{[1
]}

Picheny, Michael ^{[1
]}

Nahamoo, David ^{[1
]}

机构：

[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

automatic speech recognition; neural networks; end-to-end;

D O I：

10.21437/Interspeech.2017-546

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However. they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and Call-Home. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units. CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switch-board/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

引用

页码：959 / 963

页数：5

共 26 条

[1] [Anonymous], 2017, INTERSPEECH
[2] [Anonymous], 2011, P INTERSPEECH
[3] [Anonymous], 1997, Statistical methods for speech recognition
[4] [Anonymous], 2014, ARXIV NEURAL EVOLUTI
[5] [Anonymous], 2015, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
[6] [Anonymous], 2014, INT C LEARN REPR
[7] [Anonymous], 2015, P INTERSPEECH
[8] [Anonymous], 2014, ARXIV14125567V2CSCL
[9] Audhkhasi K, 2016, INT CONF ACOUST SPEE, P5995, DOI 10.1109/ICASSP.2016.7472828
[10] Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618

← 1 2 3 →