Deep Speaker Embeddings for Short-Duration Speaker Verification

被引:105
作者
Bhattacharya, Gautam [1 ,2 ]
Alam, Jahangir [2 ]
Kenny, Patrick [2 ]
机构
[1] McGill Univ, Montreal, PQ, Canada
[2] Comp Res Inst Montreal, Montreal, PQ, Canada
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
speaker recognition; convolutional neural networks; deep learning; i-vectors;
D O I
10.21437/Interspeech.2017-1575
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-linear mapping from i-vectors to speaker labels is beneficial for speaker verification [1]. In this work we take the idea of learning a speaker classifier one step further - we apply deep neural networks directly to time-frequency speech representations. We propose two feedforward network architectures for this task. Our hest model is based on a deep convolutional architecture wherein recordings are treated as images. From our experimental findings we advocate treating utterances as images or 'speaker snapshots, much like in face recognition. Our convolutional speaker embeddings perform significantly better than i-vectors when scoring is done using cosine distance, where the relative improvement is 23.5%. The proposed deep embeddings combined with cosine distance also outperform a state-of-the-art i-vector verification system by 1%, providing further empirical evidence in favor of our learned speaker features.
引用
收藏
页码:1517 / 1521
页数:5
相关论文
共 16 条
[1]  
[Anonymous], SPOK LANG TECHN WORK
[2]  
[Anonymous], P BR MACH VIS
[3]  
[Anonymous], DEEP NEURAL NETWORK
[4]  
[Anonymous], 2015, INT C LEARNING REPRE
[5]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[6]  
Heigold G, 2016, INT CONF ACOUST SPEE, P5115, DOI 10.1109/ICASSP.2016.7472652
[7]  
Ioffe S., 2015, ARXIV150203167, P448, DOI DOI 10.48550/ARXIV.1502.03167
[8]  
Kenny P., 2010, OD 2010 SPEAK LANG R, P14
[9]   A study of interspeaker variability in speaker verification [J].
Kenny, Patrick ;
Ouellet, Pierre ;
Dehak, Najim ;
Gupta, Vishwa ;
Dumouchel, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (05) :980-988
[10]  
Prince S., 2007, COMPUTER VISION 2007, P1