Label-Driven Time-Frequency Masking for Robust Speech Command Recognition

被引：1

作者：

Soni, Meet ^{[1
]}

Sheikh, Imran ^{[1
]}

Kopparapu, Sunil Kumar ^{[1
]}

机构：

[1] TCS Res & Innovat, Mumbai, Maharashtra, India

来源：

TEXT, SPEECH, AND DIALOGUE (TSD 2019) | 2019年 / 11697卷

关键词：

Robust speech recognition; Time-frequency masking; Label driven masking; FEATURES;

D O I：

10.1007/978-3-030-27947-9_29

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech enhancement driven robust Automatic Speech Recognition (ASR) systems typically require a parallel corpus with noisy and clean speech utterances for training. Moreover, many studies have reported that such front-ends, though improve speech quality, do not translate into improved recognition performance. On the other hand, multi-condition training of ASR systems has little visualization or interpretability capabilities of how these systems achieve robustness. In this paper, we propose a novel neural architecture with unified enhancement and sequence classification block, that is trained in an end-to-end manner only using noisy speech without having any knowledge of clean speech. The enhancement block is a fully convolutional network that is designed to perform Time Frequency (T-F) masking like operation, followed by an LSTM sequence classification block. The T-F masking formulation enables visualization of learned mask and helps us to analyse the T-F points that are important for classification of a speech command. Experiments performed on Google Speech Command dataset show that the proposed network achieves better results than the model without an enhancement front-end.

引用

页码：341 / 351

页数：11

共 19 条

[1] [Anonymous], 2017, NEW ERA ROBUST SPEEC
[2] End-to-End Speech Command Recognition with Capsule Network
Bae, Jaesung
Kim, Dae-Shik
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 776 - 780
[3] Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
[4] Improved Automatic Speech Recognition using Subband Temporal Envelope Features and Time-delay Neural Network Denoising Autoencoder
Cong-Thanh Do
Stylianou, Yannis
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3832 - 3836
[5] Du J, 2014, INTERSPEECH, P616
[6] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
[7] Denoised Bottleneck Features From Deep Autoencoders for Telephone Conversation Analysis
Janod, Killian
Morchid, Mohamed
Dufour, Richard
Linares, Georges
De Mori, Renato
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (09) : 1505 - 1516
[8] Kim S, 2017, INT CONF ACOUST SPEE, P4835, DOI 10.1109/ICASSP.2017.7953075
[9] Maas AL, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P22
[10] Marchi E, 2015, INT CONF ACOUST SPEE, P1996, DOI 10.1109/ICASSP.2015.7178320

← 1 2 →