Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR

被引:10
作者
Long, Yanhua [1 ]
Li, Yijie [2 ]
Wei, Shuang [1 ]
Zhang, Qiaozheng [1 ]
Yang, Chunxia [1 ]
机构
[1] Shanghai Normal Univ, SHNU Unisound Joint Lab Nat Human Comp Interact, Shanghai 200234, Peoples R China
[2] Beijing Unisound Informat Technol Co Ltd, Beijing 100028, Peoples R China
基金
中国国家自然科学基金;
关键词
Semi-supervised learning; data preprocessing; acoustic modeling; speech recognition; SPEECH RECOGNITION;
D O I
10.1109/ACCESS.2019.2940961
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study investigated large-scale semi-supervised training (SST) to improve acoustic models for automatic speech recognition. The conventional self-training, the recently proposed committee-based SST using heterogeneous neural networks and the lattice-based SST were examined and compared. The large-scale SST was studied in deep neural network acoustic modeling with respect to the automatic transcription quality, the importance data filtering, the training data quantity and other data attributes of a large quantity of multi-genre unsupervised live data. We found that the SST behavior on large-scale ASR tasks was very different from the behavior obtained on small-scale SST: 1) big data can tolerate a certain degree of mislabeling in the automatic transcription for SST. It is possible to achieve further performance gains with more unsupervised fresh data, and even the automatic transcriptions have a certain degree of errors; 2) the audio attributes, transcription quality and importance of the fresh data are more important than the increased data quantity for large-scale SST; and 3) there are large differences in performance gains on different recognition tasks, such that the benefits highly depend on the selected data attributes of unsupervised data and the data scale of the baseline ASR system. Furthermore, we proposed a novel utterance filtering approach based on active learning to improve the data selection in large-scale SST. The experimental results showed that the SST with the proposed data filtering yields a 2-11% relative word error rate reduction on five multi-genre recognition tasks, even with the baseline acoustic model that was already well trained on a 10000-hr supervised dataset.
引用
收藏
页码:133615 / 133627
页数:13
相关论文
共 49 条
[1]  
Amodei D, 2016, PR MACH LEARN RES, V48
[2]  
[Anonymous], 2012, AS PAC SIGN INF PROC
[3]  
Arora S, 2014, INT C HIGH PERFORM
[4]  
Bahl L. R., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4), P49
[5]  
Bi MX, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3259
[6]  
Chan HY, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P737
[7]   An exploration of dropout with LSTMs [J].
Cheng, Gaofeng ;
Peddinti, Vijayaditya ;
Povey, Daniel ;
Manohar, Vimal ;
Khudanpur, Sanjeev ;
Yan, Yonghong .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1586-1590
[8]  
Cieri C., 2004, PROC LREC, P69
[9]   IMPROVING GENERALIZATION WITH ACTIVE LEARNING [J].
COHN, D ;
ATLAS, L ;
LADNER, R .
MACHINE LEARNING, 1994, 15 (02) :201-221
[10]   Multi-View and Multi-Objective Semi-Supervised Learning for HMM-Based Automatic Speech Recognition [J].
Cui, Xiaodong ;
Huang, Jing ;
Chien, Jen-Tzung .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (07) :1923-1935