Deep Speaker Recognition: Modular or Monolithic?

被引：26

作者：

Bhattacharya, Gautam ^{[1
,2
]}

Alam, Jahangir ^{[2
]}

Kenny, Patrick ^{[2
]}

机构：

[1] McGill Univ, Montreal, PQ, Canada

[2] Comp Res Inst Montreal, Montreal, PQ, Canada

来源：

INTERSPEECH 2019 | 2019年

关键词：

deep speaker recognition; end-to-end; large margin loss;

D O I：

10.21437/Interspeech.2019-3146

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Speaker recognition has made extraordinary progress with the advent of deep neural networks. In this work, we analyze the performance of end-to-end deep speaker recognizers on two popular text-independent tasks - NIST-SRE 2016 and VoxCeleb. Through a combination of a deep convolutional feature extractor, self-attentive pooling and large-margin loss functions, we achieve state-of-the-art performance on VoxCeleb. Our best individual and ensemble models show a relative improvement of 70% an 82% respectively over the best reported results on this task. On the challenging NIST-SRE 2016 task, our proposed end-to-end models show good performance but are unable to match a strong i-vector baseline. State-of-the-art systems for this task use a modular framework that combines neural network embeddings with a probabilistic linear discriminant analysis (PLDA) classifier. Drawing inspiration from this approach we propose to replace the PLDA classifier with a neural network. Our modular neural network approach is able to outperform the i-vector baseline using cosine distance to score verification trials.

引用

页码：1143 / 1147

页数：5

共 27 条

[1]

[Anonymous], 2019, UTTERANCE LEVEL AGGR

[2]

Bhattacharya G., 2019, AC SPEECH SIGN PROC

[3]

Bhattacharya G., 2019, GENERATIVE ADVERSARI

[4] Deep Speaker Embeddings for Short-Duration Speaker Verification [J].

Bhattacharya, Gautam ;

Alam, Jahangir ;

Kenny, Patrick .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1517-1521

[5]

Bhattacharya G, 2016, IEEE W SP LANG TECH, P192, DOI 10.1109/SLT.2016.7846264

[6]

Cai W., 2018, P OD SPEAK LANG REC, P74

[7]

Chung Joon Son, 2018, P INTERSPEECH, DOI DOI 10.21437/INTERSPEECH.2018-1929

[8]

Deng J., 2018, ARXIV180107698

[9]

Garcia-Romero D., 2014, P OD SPEAK LANG REC, V8

[10] Generative Adversarial Networks [J].

Goodfellow, Ian ;

Pouget-Abadie, Jean ;

Mirza, Mehdi ;

Xu, Bing ;

Warde-Farley, David ;

Ozair, Sherjil ;

Courville, Aaron ;

Bengio, Yoshua .

COMMUNICATIONS OF THE ACM, 2020, 63 (11) :139-144

← 1 2 3 →