Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition

被引:24
|
作者
Qian, Yanmin [1 ]
Tan, Tian [1 ]
Yu, Dong [2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] Microsoft Res, Redmond, WA 98052 USA
基金
美国国家科学基金会;
关键词
Deep neural network; factor-aware training; factor representation; multi-task learning; robust speech recognition; FRONT-END; SPEAKER; ADAPTATION;
D O I
10.1109/TASLP.2016.2598308
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in noisy environments. In this paper, a novel factor-aware training framework, named neural network-based multifactor aware joint training, is proposed to improve the recognition accuracy for noise robust speech recognition. This approach is a structured model which integrates several different functional modules into one computational deep model. We explore and extract speaker, phone, and environment factor representations using deep neural networks (DNNs), which are integrated into the main ASR DNN to improve classification accuracy. In addition, the hidden activations in the main ASR DNN are used to improve factor extraction, which in turn helps theASRDNN. All the model parameters, including those in the ASR DNN and factor extraction DNNs, are jointly optimized under the multitask learning framework. Unlike prior traditional techniques for the factor-aware training, our approach requires no explicit separate stages for factor extraction and adaptation. Moreover, the proposed neural network-based multifactor aware joint training can be easily combined with the conventional factor-aware training which uses the explicit factors, such as i-vector, noise energy, and T 60 value to obtain additional improvement. The proposed method is evaluated on two main noise robust tasks: the AMI single distant microphone task in which reverberation is the main concern, and the Aurora4 task in which multiple noise types exist. Experiments on both tasks show that the proposed model can significantly reduce word error rate (WER). The best configuration achieved more than 15% relative reduction in WER over the baselines on these two tasks.
引用
收藏
页码:2231 / 2240
页数:10
相关论文
共 50 条
  • [1] JOINT ACOUSTIC FACTOR LEARNING FOR ROBUST DEEP NEURAL NETWORK BASED AUTOMATIC SPEECH RECOGNITION
    Kundu, Souvik
    Mantena, Gautam
    Qian, Yanmin
    Tan, Tian
    Delcroix, Marc
    Sim, Khe Chai
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5025 - 5029
  • [2] Prediction model of multi-factor aware mobile terminal replacement based on deep neural network
    Chen W.-Q.
    Wang J.-C.
    Chen L.
    Yang Y.-Q.
    Wu Y.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2021, 55 (01): : 109 - 115
  • [3] Deep Neural Network Based Speech Separation for Robust Speech Recognition
    Tu Yanhui
    Jun, Du
    Xu Yong
    Dai Lirong
    Chin-Hui, Lee
    2014 12TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP), 2014, : 532 - 536
  • [4] Multi-factor joint normalisation for face recognition in the wild
    Liu, Yanfei
    Chen, Junhua
    IET COMPUTER VISION, 2021, 15 (06) : 405 - 417
  • [5] INTEGRATED ADAPTATION WITH MULTI-FACTOR JOINT-LEARNING FOR FAR-FIELD SPEECH RECOGNITION
    Qian, Yanmin
    Tan, Tian
    Yu, Dong
    Zhang, Yu
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5770 - 5774
  • [6] A Joint Training Framework for Robust Automatic Speech Recognition
    Wang, Zhong-Qiu
    Wang, DeLiang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (04) : 796 - 806
  • [7] Robust automatic speech recognition based on neural network in reverberant environments
    Bai, L.
    Li, H. L.
    He, Y. Y.
    CIVIL, ARCHITECTURE AND ENVIRONMENTAL ENGINEERING, VOLS 1 AND 2, 2017, : 1319 - 1324
  • [8] Noise-Robust Speech Recognition Based on RBF Neural Network
    Hou, Xuemei
    HIGH PERFORMANCE STRUCTURES AND MATERIALS ENGINEERING, PTS 1 AND 2, 2011, 217-218 : 413 - 418
  • [9] LOCAL TRAJECTORY BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION WITH DEEP NEURAL NETWORK
    You, Yongbin
    Qian, Yanmin
    Yu, Kai
    2015 IEEE CHINA SUMMIT & INTERNATIONAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING, 2015, : 5 - 9
  • [10] Multi-factor authentication model based on multipurpose speech watermarking and online speaker recognition
    Nematollahi, Mohammad Ali
    Gamboa-Rosales, Hamurabi
    Martinez-Ruiz, Francisco J.
    De la Rosa-Vargas, Jose I.
    Al-Haddad, S. A. R.
    Esmaeilpour, Mansour
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (05) : 7251 - 7281