CYCLE-CONSISTENCY TRAINING FOR END-TO-END SPEECH RECOGNITION

被引:0
|
作者
Hori, Takaaki [1 ]
Astudillo, Ramon [2 ]
Hayashi, Tomoki [3 ]
Zhang, Yu [4 ]
Watanabe, Shinji [5 ]
Le Roux, Jonathan [1 ]
机构
[1] MERL, Cambridge, MA 02139 USA
[2] INESC ID, Spoken Language Syst Lab, Lisbon, Portugal
[3] Nagoya Univ, Nagoya, Aichi, Japan
[4] Google Inc, Mountain View, CA USA
[5] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
speech recognition; end-to-end; unpaired data; cycle consistency; NEURAL-NETWORKS;
D O I
10.1109/icassp.2019.8683307
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a method to train end-to-end automatic speech recognition ( ASR) models using unpaired data. Although the end-to-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given transformation, e.g., text-to-speech ( TTS) with ASR, to build a loss that only requires unsupervised data, speech in this example. Applying cycle consistency to ASR models is not trivial since fundamental information, such as speaker traits, are lost in the intermediate text bottleneck. To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal. This is achieved by training a Text-To-Encoder model and defining a loss based on the encoder reconstruction error. Experimental results on the LibriSpeech corpus show that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio data without transcriptions. We also investigate the use of text-only data mainly for language modeling to further improve the performance in the unpaired data training scenario.
引用
收藏
页码:6271 / 6275
页数:5
相关论文
共 50 条
  • [1] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
    Kim, Chanwoo
    Kim, Sungsoo
    Kim, Kwangyoun
    Kumar, Mehul
    Kim, Jiyeon
    Lee, Kyungmin
    Han, Changwoo
    Garg, Abhinav
    Kim, Eunhyang
    Shin, Minkyoo
    Singh, Shatrughan
    Heck, Larry
    Gowda, Dhananjaya
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569
  • [2] SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION
    Kahn, Jacob
    Lee, Ann
    Hannun, Awni
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7084 - 7088
  • [3] SEQUENCE-LEVEL CONSISTENCY TRAINING FOR SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION
    Masumura, Ryo
    Ihori, Mana
    Takashima, Akihiko
    Moriya, Takafumi
    Ando, Atsushi
    Shinohara, Yusuke
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7054 - 7058
  • [4] End-to-End Speech Recognition Sequence Training With Reinforcement Learning
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    IEEE ACCESS, 2019, 7 : 79758 - 79769
  • [5] Improved training for online end-to-end speech recognition systems
    Kim, Suyoun
    Seltzer, Michael L.
    Li, Jinyu
    Zhao, Rui
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2913 - 2917
  • [6] Multitask Training with Text Data for End-to-End Speech Recognition
    Wang, Peidong
    Sainath, Tara N.
    Weiss, Ron J.
    INTERSPEECH 2021, 2021, : 2566 - 2570
  • [7] SEQUENCE NOISE INJECTED TRAINING FOR END-TO-END SPEECH RECOGNITION
    Saon, George
    Tuske, Zoltan
    Audhkhasi, Kartik
    Kingsbury, Brian
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6261 - 6265
  • [8] Improved training of end-to-end attention models for speech recognition
    Zeyer, Albert
    Irie, Kazuki
    Schlueter, Ralf
    Ney, Hermann
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 7 - 11
  • [9] Serialized Output Training for End-to-End Overlapped Speech Recognition
    Kanda, Naoyuki
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Yoshioka, Takuya
    INTERSPEECH 2020, 2020, : 2797 - 2801
  • [10] Fully end-to-end EEG to speech translation using multi-scale optimized dual generative adversarial network with cycle-consistency loss
    Ma, Chen
    Zhang, Yue
    Guo, Yina
    Liu, Xin
    Hong, Shangguan
    Wang, Juan
    Zhao, Luqing
    NEUROCOMPUTING, 2025, 616