Label Driven Time-Frequency Masking for Robust Continuous Speech Recognition

被引:4
作者
Soni, Meet [1 ]
Panda, Ashish [1 ]
机构
[1] TCS Innovat Lab, Yantra Pk, Mumbai, Maharashtra, India
来源
INTERSPEECH 2019 | 2019年
关键词
speech recognition; Time-Frequency making; robust speech recognition; multi-conditioned training; FRONT-END; NOISE;
D O I
10.21437/Interspeech.2019-2172
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The application of Time-Frequency (T-F) masking based approaches for Automatic Speech Recognition has been shown to provide significant gains in system performance in the presence of additive noise. Such approaches give performance improvement when the T-F masking front-end is trained jointly with the acoustic model. However, such systems still rely on a pre-trained T-F masking enhancement block, trained using pairs of clean and noisy speech signals. Pre-training is necessary due to large number of parameters associated with the enhancement network. In this paper, we propose a flat-start joint training of a network that has both a T-F masking based enhancement block and a phoneme classification block. In particular, we use fully convolutional network as an enhancement front-end to reduce the number of parameters. We train the network by jointly updating the parameters of both these blocks using tied Context-Dependent phoneme states as targets. We observe that pretraining of the proposed enhancement block is not necessary for the convergence. In fact, the proposed flat-start joint training converges faster than the baseline multi-condition trained model. The experiments performed on Aurora-4 database show 7.06% relative improvement over multi-conditioned baseline. We get similar improvements for unseen test conditions as well.
引用
收藏
页码:426 / 430
页数:5
相关论文
共 17 条
[1]  
[Anonymous], 2002, Tech. Rep
[2]  
Gao T, 2015, INT CONF ACOUST SPEE, P4375, DOI 10.1109/ICASSP.2015.7178797
[3]  
Hu G, 2004, 100 NONSPEECH ENV SO
[4]  
Kingma DP, 2014, ADV NEUR IN, V27
[5]  
Mitra V, 2017, New Era for Robust Speech Recognition: Exploiting Deep Learning, P187, DOI [10.1007/978-3-319-64680-08, DOI 10.1007/978-3-319-64680-08]
[6]  
Narayanan Arun, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P2504, DOI 10.1109/ICASSP.2014.6854051
[7]  
Narayanan A, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3571
[8]   Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training [J].
Narayanan, Arun ;
Wang, DeLiang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (01) :92-101
[9]   Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition [J].
Narayanan, Arun ;
Wang, DeLiang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (04) :826-835
[10]   The role of binary mask patterns in automatic speech recognition in background noise [J].
Narayanan, Arun ;
Wang, DeLiang .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 133 (05) :3083-3093