Label-Driven Time-Frequency Masking for Robust Speech Command Recognition

被引:1
作者
Soni, Meet [1 ]
Sheikh, Imran [1 ]
Kopparapu, Sunil Kumar [1 ]
机构
[1] TCS Res & Innovat, Mumbai, Maharashtra, India
来源
TEXT, SPEECH, AND DIALOGUE (TSD 2019) | 2019年 / 11697卷
关键词
Robust speech recognition; Time-frequency masking; Label driven masking; FEATURES;
D O I
10.1007/978-3-030-27947-9_29
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech enhancement driven robust Automatic Speech Recognition (ASR) systems typically require a parallel corpus with noisy and clean speech utterances for training. Moreover, many studies have reported that such front-ends, though improve speech quality, do not translate into improved recognition performance. On the other hand, multi-condition training of ASR systems has little visualization or interpretability capabilities of how these systems achieve robustness. In this paper, we propose a novel neural architecture with unified enhancement and sequence classification block, that is trained in an end-to-end manner only using noisy speech without having any knowledge of clean speech. The enhancement block is a fully convolutional network that is designed to perform Time Frequency (T-F) masking like operation, followed by an LSTM sequence classification block. The T-F masking formulation enables visualization of learned mask and helps us to analyse the T-F points that are important for classification of a speech command. Experiments performed on Google Speech Command dataset show that the proposed network achieves better results than the model without an enhancement front-end.
引用
收藏
页码:341 / 351
页数:11
相关论文
共 19 条
  • [1] [Anonymous], 2017, NEW ERA ROBUST SPEEC
  • [2] End-to-End Speech Command Recognition with Capsule Network
    Bae, Jaesung
    Kim, Dae-Shik
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 776 - 780
  • [3] Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
  • [4] Improved Automatic Speech Recognition using Subband Temporal Envelope Features and Time-delay Neural Network Denoising Autoencoder
    Cong-Thanh Do
    Stylianou, Yannis
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3832 - 3836
  • [5] Du J, 2014, INTERSPEECH, P616
  • [6] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
  • [7] Denoised Bottleneck Features From Deep Autoencoders for Telephone Conversation Analysis
    Janod, Killian
    Morchid, Mohamed
    Dufour, Richard
    Linares, Georges
    De Mori, Renato
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (09) : 1505 - 1516
  • [8] Kim S, 2017, INT CONF ACOUST SPEE, P4835, DOI 10.1109/ICASSP.2017.7953075
  • [9] Maas AL, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P22
  • [10] Marchi E, 2015, INT CONF ACOUST SPEE, P1996, DOI 10.1109/ICASSP.2015.7178320