A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

被引:0
作者
Xing, Xujiang [1 ]
Xu, Mingxing [2 ]
Zheng, Thomas Fang [2 ]
机构
[1] Xinjiang Univ, Sch Comp Sci & Technol, Urumqi, Peoples R China
[2] Tsinghua Univ, Beijing, Peoples R China
来源
INTERSPEECH 2024 | 2024年
关键词
speaker verification; noise-robust; multi-task; adversarial training; SPEECH ENHANCEMENT; RECOGNITION;
D O I
10.21437/Interspeech.2024-700
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic Speaker Verification (ASV) suffers from performance degradation in noisy conditions. To address this issue, we propose a novel adversarial learning framework that incorporates noise-disentanglement to establish a noise-independent speaker invariant embedding space. Specifically, the disentanglement module includes two encoders for separating speaker related and irrelevant information, respectively. The reconstruction module serves as a regularization term to constrain the noise. A feature-robust loss is also used to supervise the speaker encoder to learn noise-independent speaker embeddings without losing speaker information. In addition, adversarial training is introduced to discourage the speaker encoder from encoding acoustic condition information for achieving a speaker-invariant embedding space. Experiments on VoxCeleb1 indicate that the proposed method improves the performance of the speaker verification system under both clean and noisy conditions.
引用
收藏
页码:707 / 711
页数:5
相关论文
共 28 条
[1]  
Cai Danwei, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1, DOI 10.1109/ICASSP49357.2023.10096659
[2]  
Chen Aochuan, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1, DOI 10.1109/ICASSP49357.2023.10097245
[3]  
Chung JS, 2018, INTERSPEECH, P1086
[4]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834
[5]  
Han Sangwook, 2023, 2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN), P179, DOI 10.1109/ICUFN57995.2023.10201239
[6]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[7]   LEARNING DISENTANGLED FEATURE REPRESENTATIONS FOR SPEECH ENHANCEMENT VIA ADVERSARIAL TRAINING [J].
Hou, Nana ;
Xu, Chenglin ;
Chng, Eng Siong ;
Li, Haizhou .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :666-670
[8]  
Jakubec Maros, 2020, 2020 18th International Conference on Emerging eLearning Technologies and Applications (ICETA), P211, DOI 10.1109/ICETA51985.2020.9379245
[9]   Extended U-Net for Speaker Verification in Noisy Environments [J].
Kim, Ju-ho ;
Heo, Jungwoo ;
Shim, Hye-jin ;
Yu, Ha-Jin .
INTERSPEECH 2022, 2022, :590-594
[10]   Gradient Regularization for Noise-Robust Speaker Verification [J].
Li, Jianchen ;
Han, Jiqing ;
Song, Hongwei .
INTERSPEECH 2021, 2021, :1074-1078