DPT-FSNET: DUAL-PATH TRANSFORMER BASED FULL-BAND AND SUB-BAND FUSION NETWORK FOR SPEECH ENHANCEMENT

被引:74
作者
Dang, Feng [1 ,2 ,3 ]
Chen, Hangting [1 ]
Zhangt, Pengyuan [1 ]
机构
[1] Chinese Acad Sci, Key Lab Speech Acoust & Content Understanding, Inst Acoust, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
speech enhancement; frequency domain; dual-path transformer; full-band and sub-band fusion;
D O I
10.1109/ICASSP43922.2022.9746171
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Sub-band models have achieved promising results due to their ability to model local patterns in the spectrogram. Some studies further improve the performance by fusing sub-band and full-band information. However, the structure for the full-band and sub-band fusion model was not fully explored. This paper proposes a dual-path transformer-based full-band and sub-band fusion network (DPT-FSNet) for speech enhancement in the frequency domain. The intra and inter parts of the dual-path transformer model sub-band and full-band information, respectively. The features utilized by our proposed method are more interpretable than those utilized by the time-domain dual-path transformer. We conducted experiments on the Voice Bank + DEMAND and Interspeech 2020 Deep Noise Suppression (DNS) datasets to evaluate the proposed method. Experimental results show that the proposed method outperforms the current state-of-the-art.
引用
收藏
页码:6857 / 6861
页数:5
相关论文
共 34 条
[1]   Single-channel speech enhancement using learnable loss mixup [J].
Chang, Oscar ;
Tran, Dung N. ;
Koishida, Kazuhito .
INTERSPEECH 2021, 2021, :2696-2700
[2]   Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation [J].
Chen, Jingjing ;
Mao, Qirong ;
Liu, Dong .
INTERSPEECH 2020, 2020, :2642-2646
[3]   Real Time Speech Enhancement in the Waveform Domain [J].
Defossez, Alexandre ;
Synnaeve, Gabriel ;
Adi, Yossi .
INTERSPEECH 2020, 2020, :3291-3295
[4]  
Fu SW, 2019, PR MACH LEARN RES, V97
[5]   FULLSUBNET: A FULL-BAND AND SUB-BAND FUSION MODEL FOR REAL-TIME SINGLE-CHANNEL SPEECH ENHANCEMENT [J].
Hao, Xiang ;
Su, Xiangdong ;
Horaud, Radu ;
Li, Xiaofei .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6633-6637
[6]   Evaluation of objective quality measures for speech enhancement [J].
Hu, Yi ;
Loizou, Philipos C. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01) :229-238
[7]   Densely Connected Convolutional Networks [J].
Huang, Gao ;
Liu, Zhuang ;
van der Maaten, Laurens ;
Weinberger, Kilian Q. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2261-2269
[8]   PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss [J].
Isik, Umut ;
Giri, Ritwik ;
Phansalkar, Neerad ;
Valin, Jean-Marc ;
Helwani, Karim ;
Krishnaswamy, Arvindh .
INTERSPEECH 2020, 2020, :2487-2491
[9]  
ITUT Rec, 2005, P 862 2 WID EXT REC
[10]   SE-Conformer: Time-Domain Speech Enhancement using Conformer [J].
Kim, Eesung ;
Seo, Hyeji .
INTERSPEECH 2021, 2021, :2736-2740