Deep Speaker Recognition: Modular or Monolithic?

被引:26
作者
Bhattacharya, Gautam [1 ,2 ]
Alam, Jahangir [2 ]
Kenny, Patrick [2 ]
机构
[1] McGill Univ, Montreal, PQ, Canada
[2] Comp Res Inst Montreal, Montreal, PQ, Canada
来源
INTERSPEECH 2019 | 2019年
关键词
deep speaker recognition; end-to-end; large margin loss;
D O I
10.21437/Interspeech.2019-3146
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speaker recognition has made extraordinary progress with the advent of deep neural networks. In this work, we analyze the performance of end-to-end deep speaker recognizers on two popular text-independent tasks - NIST-SRE 2016 and VoxCeleb. Through a combination of a deep convolutional feature extractor, self-attentive pooling and large-margin loss functions, we achieve state-of-the-art performance on VoxCeleb. Our best individual and ensemble models show a relative improvement of 70% an 82% respectively over the best reported results on this task. On the challenging NIST-SRE 2016 task, our proposed end-to-end models show good performance but are unable to match a strong i-vector baseline. State-of-the-art systems for this task use a modular framework that combines neural network embeddings with a probabilistic linear discriminant analysis (PLDA) classifier. Drawing inspiration from this approach we propose to replace the PLDA classifier with a neural network. Our modular neural network approach is able to outperform the i-vector baseline using cosine distance to score verification trials.
引用
收藏
页码:1143 / 1147
页数:5
相关论文
共 27 条
[1]  
[Anonymous], 2019, UTTERANCE LEVEL AGGR
[2]  
Bhattacharya G., 2019, AC SPEECH SIGN PROC
[3]  
Bhattacharya G., 2019, GENERATIVE ADVERSARI
[4]   Deep Speaker Embeddings for Short-Duration Speaker Verification [J].
Bhattacharya, Gautam ;
Alam, Jahangir ;
Kenny, Patrick .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1517-1521
[5]  
Bhattacharya G, 2016, IEEE W SP LANG TECH, P192, DOI 10.1109/SLT.2016.7846264
[6]  
Cai W., 2018, P OD SPEAK LANG REC, P74
[7]  
Chung Joon Son, 2018, P INTERSPEECH, DOI DOI 10.21437/INTERSPEECH.2018-1929
[8]  
Deng J., 2018, ARXIV180107698
[9]  
Garcia-Romero D., 2014, P OD SPEAK LANG REC, V8
[10]   Generative Adversarial Networks [J].
Goodfellow, Ian ;
Pouget-Abadie, Jean ;
Mirza, Mehdi ;
Xu, Bing ;
Warde-Farley, David ;
Ozair, Sherjil ;
Courville, Aaron ;
Bengio, Yoshua .
COMMUNICATIONS OF THE ACM, 2020, 63 (11) :139-144