X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION

被引:0
作者
Snyder, David [1 ]
Garcia-Romero, Daniel
Sell, Gregory
Povey, Daniel
Khudanpur, Sanjeev
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
来源
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2018年
基金
美国国家科学基金会;
关键词
speaker recognition; deep neural networks; data augmentation; x-vectors; NOISE;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.
引用
收藏
页码:5329 / 5333
页数:5
相关论文
共 28 条
[11]  
Heigold G, 2016, INT CONF ACOUST SPEE, P5115, DOI 10.1109/ICASSP.2016.7472652
[12]  
Ioffe S, 2006, LECT NOTES COMPUT SC, V3954, P531
[13]  
Kenny P., 2010, OD 2010 SPEAK LANG R, P14
[14]  
Ko T, 2017, INT CONF ACOUST SPEE, P5220, DOI 10.1109/ICASSP.2017.7953152
[15]  
Lei Y, 2012, INT CONF ACOUST SPEE, P4253, DOI 10.1109/ICASSP.2012.6288858
[16]   The 2016 Speakers in the Wild Speaker Recognition Evaluation [J].
McLaren, Mitchell ;
Ferrer, Luciana ;
Castan, Diego ;
Lawson, Aaron .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :823-827
[17]  
McLaren M, 2015, INT CONF ACOUST SPEE, P4814, DOI 10.1109/ICASSP.2015.7178885
[18]   VoxCeleb: a large-scale speaker identification dataset [J].
Nagrani, Arsha ;
Chung, Joon Son ;
Zisserman, Andrew .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :2616-2620
[19]  
Novotny O., 2016, SPOK LANG TECHN WORK
[20]  
Povey D., 2011, IEEE 2011 WORKSH AUT