DNN i-vector Speaker Verification with Short, Text-constrained Test Utterances

被引：23

作者：

Zhong, Jinghua ^{[1
]}

Hu, Wenping ^{[2
]}

Soong, Frank ^{[2
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China

[2] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

DNN i-vector; DNN adaptation; senone; frame alignment; RECOGNITION; FEATURES;

D O I：

10.21437/Interspeech.2017-1036

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We investigate how to improve the performance of DNN i-vector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification. due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with "phonetically aware" Deep Neural Net (DNN) in its capability on "stochastic phonetic-alignment" in constructing supervectors and estimating the corresponding i-vectors with two speech databases: a large vocabulary, conversational, speaker independent database (Fisher) and a small vocabulary, continuous digit database (RSR2015 Part III). The phonetic alignment efficiency and resultant speaker verification performance are compared with differently sized senone sets which can characterize the phonetic pronunciations of utterances in the two databases. Performance on RSR2015 Part III evaluation shows a relative improvement of EER, i.e., 7.89% for male speakers and 3.54% for female speakers with only digit related senones. The DNN bottleneck features were also studied to investigate their capability of extracting phonetic sensitive information which is useful for text-independent or text-constrained speaker verifications. We found that by tandeming MFCC with bottleneck features, EERs can be further reduced.

引用

页码：1507 / 1511

页数：5

共 19 条

[1] Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection
Belhumeur, PN
Hespanha, JP
Kriegman, DJ
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (07) : 711 - 720
[2] Chen LP, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P229
[3] Chen LP, 2016, INT CONF ACOUST SPEE, P5485, DOI 10.1109/ICASSP.2016.7472726
[4] Front-End Factor Analysis for Speaker Verification
Dehak, Najim
Kenny, Patrick J.
Dehak, Reda
Dumouchel, Pierre
Ouellet, Pierre
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04): : 788 - 798
[5] Grézl F, 2007, INT CONF ACOUST SPEE, P757
[6] Hebert M., 2008, SPRINGER HDB SPEECH, P743
[7] Text-dependent speaker verification: Classifiers, databases and RSR2015
Larcher, Anthony
Lee, Kong Aik
Ma, Bin
Li, Haizhou
[J]. SPEECH COMMUNICATION, 2014, 60 : 56 - 77
[8] Larcher A, 2013, INT CONF ACOUST SPEE, P7673, DOI 10.1109/ICASSP.2013.6639156
[9] Feature sparsity analysis for i-vector based speaker verification
Li, Wei
Fu, Tianfan
You, Hanxu
Zhu, Jie
Chen, Ning
[J]. SPEECH COMMUNICATION, 2016, 80 : 60 - 70
[10] Prince S. J., 2007, P ICCV, P1, DOI DOI 10.1109/ICCV.2007.4409052

← 1 2 →