Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

被引：20

作者：

Chen, Sanyuan ^{[1
]}

Wu, Yu ^{[2
]}

Wang, Chengyi ^{[2
]}

Liu, Shujie ^{[2
]}

Chen, Zhuo ^{[2
]}

Wang, Peidong ^{[2
]}

Liu, Gang ^{[2
]}

Li, Jinyu ^{[2
]}

Wu, Jian ^{[2
]}

Yu, Xiangzhan ^{[1
]}

Wei, Furu ^{[2
]}

机构：

[1] Harbin Inst Technol, Harbin, Peoples R China

[2] Microsoft Corp, Redmond, WA 98052 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

Self-Supervised Learning; Speaker Recognition; Speaker Verification;

D O I：

10.21437/Interspeech.2022-10019

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

引用

页码：3699 / 3703

页数：5

共 27 条

[1]

Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations

[2]

Chen Sanyuan, 2021, ARXIV211005752

[3]

Chen Sanyuan, 2021, ARXIV211013900

[4]

Chen Z., 2021, ARXIV211005777

[5]

Chiu C.-C., 2022, ARXIV220201855

[6] ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].

Deng, Jiankang ;

Guo, Jia ;

Xue, Niannan ;

Zafeiriou, Stefanos .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694

[7] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].

Desplanques, Brecht ;

Thienpondt, Jenthe ;

Demuynck, Kris .

INTERSPEECH 2020, 2020, :3830-3834

[8] Res2Net: A New Multi-Scale Backbone Architecture [J].

Gao, Shang-Hua ;

Cheng, Ming-Ming ;

Zhao, Kai ;

Zhang, Xin-Yu ;

Yang, Ming-Hsuan ;

Torr, Philip .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) :652-662

[9]

Graves A., 2006, 23 ICML, P369, DOI DOI 10.1145/1143844.1143891

[10]

Hao YR, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4143

← 1 2 3 →