FULLSUBNET: A FULL-BAND AND SUB-BAND FUSION MODEL FOR REAL-TIME SINGLE-CHANNEL SPEECH ENHANCEMENT

被引:152
作者
Hao, Xiang [1 ,2 ,3 ]
Su, Xiangdong [3 ]
Horaud, Radu [4 ]
Li, Xiaofei [1 ,2 ]
机构
[1] Westlake Univ, Hangzhou, Peoples R China
[2] Westlake Inst Adv Study, Hangzhou, Peoples R China
[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot, Peoples R China
[4] Inria Grenoble Rhone Alpes, Montbonnot St Martin, France
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
FullSubNet; Full-band and Sub-band Fusion; Sub-band; Speech Enhancement;
D O I
10.1109/ICASSP39728.2021.9414177
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band and sub-band refer to the models that input full-band and sub-band noisy spectral feature, output full-band and sub-band speech target, respectively. The sub-band model processes each frequency independently. Its input consists of one frequency and several context frequencies. The output is the prediction of the clean speech target for the corresponding frequency. These two types of models have distinct characteristics. The full-band model can capture the global spectral context and the long-distance cross-band dependencies. However, it lacks the ability to modeling signal stationarity and attending the local spectral pattern. The sub-band model is just the opposite. In our proposed FullSubNet, we connect a pure full-band model and a pure sub-band model sequentially and use practical joint training to integrate these two types of models' advantages. We conducted experiments on the DNS challenge (INTERSPEECH 2020) dataset to evaluate the proposed method. Experimental results show that full-band and sub-band information are complementary, and the FullSubNet can effectively integrate them. Besides, the performance of the FullSubNet also exceeds that of the top-ranked methods in the DNS Challenge (INTERSPEECH 2020).
引用
收藏
页码:6633 / 6637
页数:5
相关论文
共 25 条
[1]   Long short-term memory for speaker generalization in supervised speech separation [J].
Chen, Jitong ;
Wang, DeLiang .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) :4705-4714
[2]   Speech enhancement for non-stationary noise environments [J].
Cohen, I ;
Berdugo, B .
SIGNAL PROCESSING, 2001, 81 (11) :2403-2418
[3]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121
[4]  
Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[5]   Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay [J].
Gerkmann, Timo ;
Hendriks, Richard C. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04) :1383-1393
[6]  
Hadad E, 2014, 2014 14TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), P313, DOI 10.1109/IWAENC.2014.6954309
[7]   UNetGAN: A Robust Speech Enhancement Approach in Time Domain for Extremely Low Signal-to-noise Ratio Condition [J].
Hao, Xiang ;
Su, Xiangdong ;
Wang, Zhiyu ;
Zhang, Hui ;
Batushiren .
INTERSPEECH 2019, 2019, :1786-1790
[8]   DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement [J].
Hu, Yanxin ;
Liu, Yun ;
Lv, Shubo ;
Xing, Mengtao ;
Zhang, Shimin ;
Fu, Yihui ;
Wu, Jian ;
Zhang, Bihong ;
Xie, Lei .
INTERSPEECH 2020, 2020, :2472-2476
[9]   PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss [J].
Isik, Umut ;
Giri, Ritwik ;
Phansalkar, Neerad ;
Valin, Jean-Marc ;
Helwani, Karim ;
Krishnaswamy, Arvindh .
INTERSPEECH 2020, 2020, :2487-2491
[10]   A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research [J].
Kinoshita, Keisuke ;
Delcroix, Marc ;
Gannot, Sharon ;
Habets, Emanuel A. P. ;
Haeb-Umbach, Reinhold ;
Kellermann, Walter ;
Leutnant, Volker ;
Maas, Roland ;
Nakatani, Tomohiro ;
Raj, Bhiksha ;
Sehr, Armin ;
Yoshioka, Takuya .
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016, :1-19