MULTI-VIEW SPEAKER EMBEDDING LEARNING FOR ENHANCED STABILITY AND DISCRIMINABILITY

被引:1
作者
He, Liang [1 ,2 ,3 ]
Fang, Zhihua [1 ,2 ]
Chen, Zuoer [3 ]
Xu, Minqiang [4 ]
Men, Ying [1 ,2 ]
Wang, Penghao [3 ]
机构
[1] Xinjiang Univ, Sch Comp Sci & Technol, Urumqi, Peoples R China
[2] Xinjiang Key Lab Signal Detect & Proc, Urumqi, Peoples R China
[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China
[4] iFly Digital Technol, Hefei, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Speaker embedding; speaker verification; speaker diarization; deep clustering;
D O I
10.1109/ICASSP48485.2024.10448494
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Deep neural network models based on x-vector have become the most popular framework for speaker recognition, and the quality of speaker features (embeddings) is important for open-set tasks such as speaker verification and speaker diarization. Currently, the most popular loss function is based on margin penalty, however, it only considers enlarging the inter-class distance while neglecting to reduce the intra-class feature differences. Therefore, we propose a multi-view learning approach that divides the training process into two views from the speaker embedding level. The classification view focuses on distinguishing the discriminability of different speakers, while the clustering view focuses on shrinking the feature boundaries of the same speaker, making intra-class differences smaller. The combined effect of the two perspectives achieves large inter-class distance and small intra-class distances, resulting in the extraction of more discriminative and stable speaker embeddings. We test the performance of the method on both speaker verification and speaker diarization tasks, and the results demonstrate the effectiveness of our approach.
引用
收藏
页码:10081 / 10085
页数:5
相关论文
共 22 条
[1]   Speaker recognition based on deep learning: An overview [J].
Bai, Zhongxin ;
Zhang, Xiao-Lei .
NEURAL NETWORKS, 2021, 140 :65-99
[2]   PLAYING A PART: SPEAKER VERIFICATION AT THE MOVIES [J].
Brown, Andrew ;
Huh, Jaesung ;
Nagrani, Arsha ;
Chung, Joon Son ;
Zisserman, Andrew .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6174-6178
[3]   Deep Clustering for Unsupervised Learning of Visual Features [J].
Caron, Mathilde ;
Bojanowski, Piotr ;
Joulin, Armand ;
Douze, Matthijs .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156
[4]   Spot the conversation: speaker diarisation in the wild [J].
Chung, Joon Son ;
Huh, Jaesung ;
Nagrani, Arsha ;
Afouras, Triantafyllos ;
Zisserman, Andrew .
INTERSPEECH 2020, 2020, :299-303
[5]  
Chung JS, 2018, INTERSPEECH, P1086
[6]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Yang, Jing ;
Xue, Niannan ;
Kotsia, Irene ;
Zafeiriou, Stefanos .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) :5962-5979
[7]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834
[8]   Meta-learning representations for clustering with infinite Gaussian mixture models [J].
Iwata, Tomoharu .
NEUROCOMPUTING, 2023, 549
[9]  
Ko T, 2017, INT CONF ACOUST SPEE, P5220, DOI 10.1109/ICASSP.2017.7953152
[10]   ANALYSIS OF THE BUT DIARIZATION SYSTEM FOR VOXCONVERSE CHALLENGE [J].
Landini, Federico ;
Glembek, Ondrej ;
Matejka, Pavel ;
Rohdin, Johan ;
Burget, Lukas ;
Diez, Mireia ;
Silnova, Anna .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5819-5823