Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition

被引：3

作者：

Bu, Suliang ^{[1
]}

Zhao, Yunxin ^{[1
]}

Zhao, Tuo ^{[1
]}

Wang, Shaojun ^{[2
]}

Han, Mei ^{[2
]}

机构：

[1] Univ Missouri, Dept Elect Engn & Comp Sci, Spoken Language & Informat Proc Lab, Columbia, MO 65211 USA

[2] PAII Inc, Palo Alto, CA USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

关键词：

Speech enhancement; Noise measurement; Speech recognition; Artificial neural networks; Training; Estimation; Spectrogram; Time-frequency masks; beamforming; speech enhancement and recognition; speech region; UNet plus plus; SEPARATION; BINARY; INTELLIGIBILITY; LIKELIHOOD; CHALLENGE; NOISE; TIME;

D O I：

10.1109/TASLP.2022.3196168

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Time-frequency (TF) masks are widely used in speech enhancement (SE). However, accurately estimating TF masks from noisy speech remains a challenge to both statistical or neural network (NN) approaches. Statistical model based mask estimation usually depends on a good parameter initialization, while NN-based method relies on setting proper and stable learning targets. To address these issues, we propose to extract TF speech structure from clean speech and partition noisy speech spectrogram into mutually exclusive regions. We investigate modeling clean speech by utterance-specific narrowband complex Gaussian mixture models to derive the regions, and using the region targets to supervise the training of UNet++, a high-performance NN, for predicting regions from noisy speech. For multichannel SE, we consider two scenarios of using speech regions: 1) integrating the regions with TF masks by constraining the mask values or the model parameter updates, and 2) using the predicted regions in place of TF masks. For single-channel SE, we consider using the region targets to improve TF mask targets. Furthermore, we propose to use UNet++ for TF mask estimation. Our experiment results on speech recognition (CHiME-3) and SE (CHiME-3 and LibriSpeech) have demonstrated the effectiveness of our proposed approach of modeling speech region structure to improve TF masks for speech recognition and enhancement.

引用

页码：2705 / 2715

页数：11

共 61 条

[1] Alpaydm E., 2004, INTRO MACHINE LEARNI, P22
[2] [Anonymous], 2016, P 4 INT WORKSH SPEEC
[3] The third 'CHIME' speech separation and recognition challenge: Analysis and outcomes
Barker, Jon
Marxer, Ricard
Vincent, Emmanuel
Watanabe, Shinji
[J]. COMPUTER SPEECH AND LANGUAGE, 2017, 46 : 605 - 626
[4] Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837
[5] Benesty J, 2008, SPRINGER TOP SIGN PR, V1, P1
[6] Bregman A. S., 1990, AUDITORY SCENE ANAL
[7] Bu S., 2021, P INTERSPEECH, P2731
[8] Bu SL, 2018, INTERSPEECH, P3048
[9] Chakrabarty S, 2018, INT WORKSH ACOUSTIC, P476, DOI 10.1109/IWAENC.2018.8521346
[10] Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline
Chen, Szu-Jui
Subramanian, Aswin Shanmugam
Xu, Hainan
Watanabe, Shinji
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1571 - 1575

← 1 2 3 4 5 6 7 →