Block-online multi-channel speech enhancement using deep neural network-supported relative transfer function estimates

被引:11
作者
Malek, Jiri [1 ]
Koldovsky, Zbynxk [1 ]
Bohac, Marek [1 ]
机构
[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studie, Studentska 2, Liberec, Czech Republic
关键词
array signal processing; speech recognition; speech enhancement; neural nets; transfer functions; block length; block-online multichannel speech enhancement; deep neural network-supported relative transfer function estimates; block-online processing; short utterances; voice assistant scenarios; deep neural network-based voice activity detection; relative transfer functions; highly dynamic environments; processed block; processing regime; batch processing; perceptual evaluation; speech quality; baseline automatic speech recognition system; enhancement method; time; 250; 0; ms; VOICE ACTIVITY DETECTION; NOISE; RECOGNITION; ENVIRONMENT; MICROPHONE; MASK;
D O I
10.1049/iet-spr.2019.0304
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This work addresses the problem of block-online processing for multi-channel speech enhancement. Such processing is vital in scenarios with moving speakers and/or when short utterances are processed, e.g. in voice assistant applications. We consider several variants of a system that performs beamforming supported by deep neural network-based voice activity detection followed by post-filtering. The speaker is targeted through estimating relative transfer functions between microphones. Each block of the input signals is processed independently to make the method applicable in highly dynamic environments. Due to short processed blocks, the statistics required by the beamformer are estimated less precisely. The influence of this inaccuracy is studied and compared to batch processing regime, when recordings are treated as one block. The experimental evaluation is performed on large datasets of CHiME-4 and another dataset featuring moving target speaker. The experiments are evaluated in terms of objective and perceptual criteria. Moreover, word error rate (WER) of a speech recognition system is evaluated, for which the method serves as a front-end. The results indicate that the proposed method is robust for short length of the processed block. Significant improvements in terms of the criteria and WER are observed even for the block length of 250 ms.
引用
收藏
页码:124 / 133
页数:10
相关论文
共 59 条
  • [1] Acoustic beamforming for speaker diarization of meetings
    Anguera, Xavier
    Wooters, Chuck
    Hernando, Javier
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07): : 2011 - 2022
  • [2] [Anonymous], 2004, OPTIMUM ARRAY PROCES, DOI DOI 10.1002/0471221104
  • [3] [Anonymous], 2016, IET SIGNAL PROCESSIN
  • [4] [Anonymous], 2016, CHIM 2016 WORKSH
  • [5] Araki S, 2016, INT CONF ACOUST SPEE, P385, DOI 10.1109/ICASSP.2016.7471702
  • [6] Barker J., 2019, 5 CHIME SPEECH SEPAR
  • [7] Boeddeker C., 2018, 2018 IEEE INT C AC S, P1
  • [8] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
    BOLL, SF
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
  • [9] Voice activity detection based on multiple statistical models
    Chang, Joon-Hyuk
    Kim, Nam Soo
    Mitra, Sanjit K.
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2006, 54 (06) : 1965 - 1976
  • [10] Relative transfer function identification using speech seals
    Cohen, I
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2004, 12 (05): : 451 - 459