Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning

被引:5
作者
Kang, Jiawen [1 ]
Liu, Ruiqi [1 ,2 ]
Li, Lantian [1 ]
Cai, Yunqi [1 ,3 ]
Wang, Dong [1 ]
Zheng, Thomas Fang [1 ]
机构
[1] Tsinghua Univ, Ctr Speech & Language Technol, Beijing, Peoples R China
[2] China Univ Min & Technol, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
来源
INTERSPEECH 2020 | 2020年
基金
中国国家自然科学基金;
关键词
speaker recognition; meta-learning; domain generalization; RECOGNITION;
D O I
10.21437/Interspeech.2020-2562
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Domain generalization remains a critical problem for speaker recognition, even with the state-of-the-art architectures based on deep neural nets. For example, a model trained on reading speech may largely fail when applied to scenarios of singing or movie. In this paper, we propose a domain-invariant projection to improve the generalizability of speaker vectors. This projection is a simple neural net and is trained following the Model-Agnostic Meta-Learning (MAML) principle, for which the objective is to classify speakers in one domain if it had been updated with speech data in another domain. We tested the proposed method on CNCeleb, a new dataset consisting of single-speaker multi-condition (SSMC) data. The results demonstrated that the MAML-based domain-invariant projection can produce more generalizable speaker vectors, and effectively improve the performance in unseen domains.
引用
收藏
页码:3825 / 3829
页数:5
相关论文
共 40 条
[1]  
Andrychowicz M, 2016, ADV NEUR IN, V29
[2]  
Bai Z., 2019, ARXIV191108077
[3]  
Bengio Samy, 1992, PREPRINTS C OPTIMALI, V2
[4]  
Boulianne D., 2011, IEEE 2011 WORKSH AUT, P1, DOI DOI 10.1017/CBO9781107415324.004
[5]  
Cai W., 2018, SPEAKER LANGUAGE REC, P74, DOI DOI 10.21437/ODYSSEY.2018-11
[6]   Tied Mixture of Factor Analyzers Layer to Combine Frame Level Representations in Neural Speaker Embeddings [J].
Chen, Nanxin ;
Villalba, Jesus ;
Dehak, Najim .
INTERSPEECH 2019, 2019, :2948-2952
[7]  
Chung JS, 2018, INTERSPEECH, P1086
[8]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[9]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Xue, Niannan ;
Zafeiriou, Stefanos .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694
[10]   MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks [J].
Ding, Wenhao ;
He, Liang .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3633-3637