Comparing Front-End Enhancement Techniques and Multiconditioned Training for Robust Automatic Speech Recognition

被引：1

作者：

Soni, Meet H. ^{[1
]}

Joshi, Sonal ^{[1
]}

Panda, Ashish ^{[1
]}

机构：

[1] TCS Innovat Labs, Mumbai, Maharashtra, India

来源：

TEXT, SPEECH, AND DIALOGUE (TSD 2019) | 2019年 / 11697卷

关键词：

Speech recognition; Noise robustness; Front-end processing; Multiconditioned training; FEATURES; NOISE;

D O I：

10.1007/978-3-030-27947-9_28

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present comparison of various front-end enhancement techniques and multiconditioned training for robust Automatic Speech Recognition (ASR) for additive noise. We compare De-noising Autoencoders (DAEs) based on Deep Neural Network (DNN), Time-Delay Neural Network (TDNN) architecture, and Time-Frequency (T-F) masking based DNN based front-ends. We train these front-ends and evaluate their performance on various seen/unseen noise conditions. In multiconditioned training, we train acoustic model on various noise conditions and test on seen/unseen noises along with Noise Aware Training (NAT). The results suggest that all front-ends provide performance improvement for seen noise conditions while degrading performance for unseen noise conditions. TDNN-DAE provides the most improvement for seen conditions while giving the most degradation for unseen conditions. We use a method to improve performance of TDNN-DAE in unseen conditions by training it on features enhanced using Vector Taylor Series with Acoustic Masking (VTS-AM) and Spectral Subtraction (SS). We show that these enhancement techniques improve the efficacy of the TDNN-DAE significantly in unseen noise conditions. Overall we observed that multiconditioned training still gives better performance in case of both seen/unseen noise conditions, although the enhanced TDNN-DAE comes closest among all the front-ends to the performance of multiconditioned training.

引用

页码：329 / 340

页数：12

共 30 条

[1]

[Anonymous], 1990, Readings Speech Recognit, DOI DOI 10.1016/B978-0-08-051584-7.50037-1

[2]

[Anonymous], 2017, NEW ERA ROBUST SPEEC

[3]

[Anonymous], 2011, WORKSH AUT SPEECH RE

[4]

[Anonymous], 2015, P INT

[5]

Berouti M., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing, P208

[6] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].

BOLL, SF .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120

[7]

Brookes M, 2011, VOICEBOX SPEECH PROC, V47

[8] Improved Automatic Speech Recognition using Subband Temporal Envelope Features and Time-delay Neural Network Denoising Autoencoder [J].

Cong-Thanh Do ;

Stylianou, Yannis .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3832-3836

[9]

Das B, 2017, INT CONF ACOUST SPEE, P5235, DOI 10.1109/ICASSP.2017.7953155

[10]

Dean D.B., 2010, P INT

← 1 2 3 →