A Fused Speech Enhancement Framework for Robust Speaker Verification

被引:3
作者
Wu, Yanfeng [1 ,2 ]
Li, Taihao [2 ]
Zhao, Junan [1 ]
Wang, Qirui [1 ]
Xu, Jing [1 ]
机构
[1] Nankai Univ, Coll Artificial Intelligence, Tianjin 300350, Peoples R China
[2] Zhejiang Lab, Dept Artificial Intelligence, Hangzhou 311121, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Robust speaker verification; speech enhancement; multi-scale feature extraction; attention mechanism;
D O I
10.1109/LSP.2023.3290832
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Robust speaker verification (RSV) under noisy conditions is still a challenging task. Recently, some task-specific speech enhancement (SE) approaches are proposed and achieve excellent performance on RSV. However, all these works adopt only one kind of SE network and thus can not remove noise from different aspects, limiting the performance of the RSV task. In this letter, we propose a fused SE framework (FSEF) for RSV, which integrates both T-F masking-based and feature mapping-based SE networks to collect complementary information and improve the robustness against noise. Two FESF-RSV systems are constructed based on two kinds of fusion methods: score fusion and feature fusion. In addition, we present a Multi-Scale Attentive Context Aggregation Network (MSACAN) as the backbone structure in the FSEF. The MSACAN can not only extract and fuse multi-scale features adaptively but also enhance speaker characteristics against noise and interfering speakers. Experiments conducted on the noise-simulated VoxCeleb1 dataset demonstrate both the FSEF and the MSACAN can improve the performance of RSV compared to previous approaches.
引用
收藏
页码:883 / 887
页数:5
相关论文
共 37 条
[1]   Speaker recognition based on deep learning: An overview [J].
Bai, Zhongxin ;
Zhang, Xiao-Lei .
NEURAL NETWORKS, 2021, 140 :65-99
[2]  
Chang J, 2017, INT CONF ACOUST SPEE, P5415, DOI 10.1109/ICASSP.2017.7953191
[3]   In defence of metric learning for speaker recognition [J].
Chung, Joon Son ;
Huh, Jaesung ;
Mun, Seongkyu ;
Lee, Minjae ;
Heo, Hee-Soo ;
Choe, Soyeon ;
Ham, Chiheon ;
Jung, Sunghwan ;
Lee, Bong-Jin ;
Han, Icksang .
INTERSPEECH 2020, 2020, :2977-2981
[4]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834
[5]  
Fan Y, 2020, INT CONF ACOUST SPEE, P7604, DOI [10.1109/icassp40776.2020.9054017, 10.1109/ICASSP40776.2020.9054017]
[6]   A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition [J].
Ferras, Marc ;
Madikeri, Srikanth ;
Motlicek, Petr ;
Dey, Subhadeep ;
Bourlard, Herve .
IEEE SIGNAL PROCESSING LETTERS, 2016, 23 (04) :527-531
[7]   Speech Denoising With Deep Feature Losses [J].
Germain, Francois G. ;
Chen, Qifeng ;
Koltun, Vladlen .
INTERSPEECH 2019, 2019, :2723-2727
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]   A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments [J].
Jung, Youngmoon ;
Choi, Yeunju ;
Lim, Hyungjun ;
Kim, Hoirin .
IEEE ACCESS, 2020, 8 :175448-175466
[10]  
Kataria S, 2020, PROC SPEAKER LANG RE, P459