Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

被引:20
作者
Chen, Sanyuan [1 ]
Wu, Yu [2 ]
Wang, Chengyi [2 ]
Liu, Shujie [2 ]
Chen, Zhuo [2 ]
Wang, Peidong [2 ]
Liu, Gang [2 ]
Li, Jinyu [2 ]
Wu, Jian [2 ]
Yu, Xiangzhan [1 ]
Wei, Furu [2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
来源
INTERSPEECH 2022 | 2022年
关键词
Self-Supervised Learning; Speaker Recognition; Speaker Verification;
D O I
10.21437/Interspeech.2022-10019
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.
引用
收藏
页码:3699 / 3703
页数:5
相关论文
共 27 条
[1]  
Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
[2]  
Chen Sanyuan, 2021, ARXIV211005752
[3]  
Chen Sanyuan, 2021, ARXIV211013900
[4]  
Chen Z., 2021, ARXIV211005777
[5]  
Chiu C.-C., 2022, ARXIV220201855
[6]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Xue, Niannan ;
Zafeiriou, Stefanos .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694
[7]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834
[8]   Res2Net: A New Multi-Scale Backbone Architecture [J].
Gao, Shang-Hua ;
Cheng, Ming-Ming ;
Zhao, Kai ;
Zhang, Xin-Yu ;
Yang, Ming-Hsuan ;
Torr, Philip .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) :652-662
[9]  
Graves A., 2006, 23 ICML, P369, DOI DOI 10.1145/1143844.1143891
[10]  
Hao YR, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4143