Real-time Multi-channel Speech Enhancement Based on Neural Network Masking with Attention Model

被引:4
作者
Xue, Cheng [1 ]
Huang, Weilong [1 ]
Chen, Weiguang [1 ]
Feng, Jinwei [2 ]
机构
[1] Alibaba Grp, Speech Lab, Hangzhou, Peoples R China
[2] Alibaba Grp, Speech Lab, Sunnyvale, CA USA
来源
INTERSPEECH 2021 | 2021年
关键词
real-time; multi-channel speech enhancement; beamforming; deep neural network;
D O I
10.21437/Interspeech.2021-2266
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose a real-time multi-channel speech enhancement method for noise reduction and dereverberation in far-field environments. The proposed method consists of two components: differential beamforming and mask estimation network. The differential beamforming is employed to suppress the interference signals from non-target directions such that a relatively clean speech can be obtained. The mask estimation network with an attention model is developed to capture the signal correlation among different channels in the feature extraction stage and enhance the feature representation that needs to be reconstructed into the target speech in the estimation mask stage. In the inference phase, the spectrum after differential beamforming is filtered by the estimated mask to obtain the final output. The spectrum after differential beamforming can provide a higher signal-to-noise ratio (SNR) than the original spectrum, so the estimated mask can more easily filter out the noise. We conducted experiments on the ConferencingSpeech2021 challenge (INTERSPEECH 2021) dataset to evaluate the proposed method. With only 2.9M parameters, the proposed method achieved competitive performance.
引用
收藏
页码:1862 / 1866
页数:5
相关论文
共 27 条
[1]  
Benesty J., 2012, Study and design of differential microphone arrays, V6, DOI DOI 10.1007/978-3-642-33753-6
[2]  
Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[3]  
Chakrabarty S, 2018, INT WORKSH ACOUSTIC, P476, DOI 10.1109/IWAENC.2018.8521346
[4]   Improved MVDR beamforming using single-channel mask prediction networks [J].
Erdogan, Hakan ;
Hershey, John ;
Watanabe, Shinji ;
Mandel, Michael ;
Le Roux, Jonathan .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :1981-1985
[5]  
Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[6]   End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks [J].
Fu, Szu-Wei ;
Wang, Tao-Wei ;
Tsao, Yu ;
Lu, Xugang ;
Kawai, Hisashi .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) :1570-1584
[7]  
Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261
[8]  
Hao X., 2020, ARXIV201015508
[9]  
Heymann J, 2016, INT CONF ACOUST SPEE, P196, DOI 10.1109/ICASSP.2016.7471664
[10]   DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement [J].
Hu, Yanxin ;
Liu, Yun ;
Lv, Shubo ;
Xing, Mengtao ;
Zhang, Shimin ;
Fu, Yihui ;
Wu, Jian ;
Zhang, Bihong ;
Xie, Lei .
INTERSPEECH 2020, 2020, :2472-2476