CYCLE-CONSISTENCY TRAINING FOR END-TO-END SPEECH RECOGNITION

被引：0

作者：

Hori, Takaaki ^{[1
]}

Astudillo, Ramon ^{[2
]}

Hayashi, Tomoki ^{[3
]}

Zhang, Yu ^{[4
]}

Watanabe, Shinji ^{[5
]}

Le Roux, Jonathan ^{[1
]}

机构：

[1] MERL, Cambridge, MA 02139 USA

[2] INESC ID, Spoken Language Syst Lab, Lisbon, Portugal

[3] Nagoya Univ, Nagoya, Aichi, Japan

[4] Google Inc, Mountain View, CA USA

[5] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

speech recognition; end-to-end; unpaired data; cycle consistency; NEURAL-NETWORKS;

D O I：

10.1109/icassp.2019.8683307

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents a method to train end-to-end automatic speech recognition ( ASR) models using unpaired data. Although the end-to-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given transformation, e.g., text-to-speech ( TTS) with ASR, to build a loss that only requires unsupervised data, speech in this example. Applying cycle consistency to ASR models is not trivial since fundamental information, such as speaker traits, are lost in the intermediate text bottleneck. To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal. This is achieved by training a Text-To-Encoder model and defining a loss based on the encoder reconstruction error. Experimental results on the LibriSpeech corpus show that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio data without transcriptions. We also investigate the use of text-only data mainly for language modeling to further improve the performance in the unpaired data training scenario.

引用

页码：6271 / 6275

页数：5

共 50 条

[1] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
Kim, Chanwoo
Kim, Sungsoo
Kim, Kwangyoun
Kumar, Mehul
Kim, Jiyeon
Lee, Kyungmin
Han, Changwoo
Garg, Abhinav
Kim, Eunhyang
Shin, Minkyoo
Singh, Shatrughan
Heck, Larry
Gowda, Dhananjaya
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569
[2] SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION
Kahn, Jacob
Lee, Ann
Hannun, Awni
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7084 - 7088
[3] SEQUENCE-LEVEL CONSISTENCY TRAINING FOR SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION
Masumura, Ryo
Ihori, Mana
Takashima, Akihiko
Moriya, Takafumi
Ando, Atsushi
Shinohara, Yusuke
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7054 - 7058
[4] End-to-End Speech Recognition Sequence Training With Reinforcement Learning
Tjandra, Andros
Sakti, Sakriani
Nakamura, Satoshi
IEEE ACCESS, 2019, 7 : 79758 - 79769
[5] Improved training for online end-to-end speech recognition systems
Kim, Suyoun
Seltzer, Michael L.
Li, Jinyu
Zhao, Rui
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2913 - 2917
[6] Multitask Training with Text Data for End-to-End Speech Recognition
Wang, Peidong
Sainath, Tara N.
Weiss, Ron J.
INTERSPEECH 2021, 2021, : 2566 - 2570
[7] SEQUENCE NOISE INJECTED TRAINING FOR END-TO-END SPEECH RECOGNITION
Saon, George
Tuske, Zoltan
Audhkhasi, Kartik
Kingsbury, Brian
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6261 - 6265
[8] Improved training of end-to-end attention models for speech recognition
Zeyer, Albert
Irie, Kazuki
Schlueter, Ralf
Ney, Hermann
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 7 - 11
[9] Serialized Output Training for End-to-End Overlapped Speech Recognition
Kanda, Naoyuki
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Yoshioka, Takuya
INTERSPEECH 2020, 2020, : 2797 - 2801
[10] Fully end-to-end EEG to speech translation using multi-scale optimized dual generative adversarial network with cycle-consistency loss
Ma, Chen
Zhang, Yue
Guo, Yina
Liu, Xin
Hong, Shangguan
Wang, Juan
Zhao, Luqing
NEUROCOMPUTING, 2025, 616

← 1 2 3 4 5 →