Multimodal Association for Speaker Verification

被引:8
作者
Shon, Suwon [1 ]
Glass, James [2 ]
机构
[1] ASAPP Inc, New York, NY 10007 USA
[2] MIT Comp Sci & Artificial Intelligence Lab, Cambridge, MA USA
来源
INTERSPEECH 2020 | 2020年
关键词
speaker verification; fine-tuning; multimodal; SRE18;
D O I
10.21437/Interspeech.2020-1996
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose a multimodal association on a speaker verification system for fine-tuning using both voice and face. Inspired by neuroscientific findings, the proposed approach is to mimic the unimodal perception system benefits from the multisensory association of stimulus pairs. To verify this, we use the SRE18 evaluation protocol for experiments and use out-of-domain data, Voxceleb, for the proposed multimodal fine-tuning. Although the proposed approach relies on voice-face paired multimodal data during the training phase, the face is no more needed after training is done and only speech audio is used for the speaker verification system. In the experiments, we observed that the unimodal model, i.e. speaker verification model, benefits from the multimodal association of voice and face and generalized better than before by learning channel invariant speaker representation.
引用
收藏
页码:2247 / 2251
页数:5
相关论文
共 37 条
[1]  
Aronowitz Hagai, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P4002, DOI 10.1109/ICASSP.2014.6854353
[2]  
Aronowitz H., 2014, ODYSSEY SPEAKER LANG, P280
[3]   Deep Speaker Recognition: Modular or Monolithic? [J].
Bhattacharya, Gautam ;
Alam, Jahangir ;
Kenny, Patrick .
INTERSPEECH 2019, 2019, :1143-1147
[4]  
Bhattacharya G, 2019, INT CONF ACOUST SPEE, P6226, DOI [10.1109/ICASSP.2019.8682064, 10.1109/icassp.2019.8682064]
[5]  
Boulianne D., 2011, IEEE 2011 WORKSH AUT, P1, DOI DOI 10.1017/CBO9781107415324.004
[6]  
Cai D., 2019, ARXIV190702191
[7]  
Cai D., 2020, ARXIV200200924
[8]  
Chen ZY, 2020, INT CONF ACOUST SPEE, P6574, DOI [10.1109/icassp40776.2020.9053905, 10.1109/ICASSP40776.2020.9053905]
[9]  
Chung J. S., 2019, ARXIV191011238
[10]  
Chung JS, 2018, INTERSPEECH, P1086