Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

被引:19
|
作者
Liu, Bin [1 ,2 ]
Nie, Shuai [1 ]
Liang, Shan [1 ]
Liu, Wenju [1 ]
Yu, Meng [3 ]
Chen, Lianwu [4 ]
Peng, Shouye [5 ]
Li, Changliang [6 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Tencent AI Lab, Bellevue, WA USA
[4] Tencent AI Lab, Shenzhen, Peoples R China
[5] Xueersi Online Sch, Beijing, Peoples R China
[6] Kingsoft AI Lab, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
end-to-end speech recognition; robust speech recognition; speech enhancement; generative adversarial networks;
D O I
10.21437/Interspeech.2019-1242
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the ASR performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.
引用
收藏
页码:491 / 495
页数:5
相关论文
共 50 条
  • [1] Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
    Li, Lujun
    Kang, Yikai
    Shi, Yuchen
    Kurzinger, Ludwig
    Watzel, Tobias
    Rigoll, Gerhard
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [2] Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
    Lujun Li
    Yikai Kang
    Yuchen Shi
    Ludwig Kürzinger
    Tobias Watzel
    Gerhard Rigoll
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [3] Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition
    Sun, Sining
    Guo, Pengcheng
    Xie, Lei
    Hwang, Mei-Yuh
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1826 - 1838
  • [4] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINTLY TRAINED NEURAL FEATURE ENHANCEMENT
    Kim, Chanwoo
    Garg, Abhinav
    Gowda, Dhananjaya
    Mun, Seongkyu
    Han, Changwoo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6773 - 6777
  • [5] End-to-End Speech Translation with Adversarial Training
    Li, Xuancai
    Chen, Kehai
    Zhao, Tiejun
    Yang, Muyun
    WORKSHOP ON AUTOMATIC SIMULTANEOUS TRANSLATION CHALLENGES, RECENT ADVANCES, AND FUTURE DIRECTIONS, 2020, : 10 - 14
  • [6] COMBINING END-TO-END AND ADVERSARIAL TRAINING FOR LOW-RESOURCE SPEECH RECOGNITION
    Drexler, Jennifer
    Glass, James
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 361 - 368
  • [7] ADVERSARIAL TRAINING OF END-TO-END SPEECH RECOGNITION USING A CRITICIZING LANGUAGE MODEL
    Liu, Alexander H.
    Lee, Hung-yi
    Lee, Lin-shan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6176 - 6180
  • [8] Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks
    Na, Hyeong-Ju
    Park, Jeong-Sik
    APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [9] SPEECH ENHANCEMENT USING END-TO-END SPEECH RECOGNITION OBJECTIVES
    Subramanian, Aswin Shanmugam
    Wang, Xiaofei
    Baskar, Murali Karthick
    Watanabe, Shinji
    Taniguchi, Toru
    Tran, Dung
    Fujita, Yuya
    2019 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2019, : 234 - 238
  • [10] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
    Kim, Chanwoo
    Kim, Sungsoo
    Kim, Kwangyoun
    Kumar, Mehul
    Kim, Jiyeon
    Lee, Kyungmin
    Han, Changwoo
    Garg, Abhinav
    Kim, Eunhyang
    Shin, Minkyoo
    Singh, Shatrughan
    Heck, Larry
    Gowda, Dhananjaya
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569