DNN i-vector Speaker Verification with Short, Text-constrained Test Utterances

被引:23
作者
Zhong, Jinghua [1 ]
Hu, Wenping [2 ]
Soong, Frank [2 ]
Meng, Helen [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
[2] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
DNN i-vector; DNN adaptation; senone; frame alignment; RECOGNITION; FEATURES;
D O I
10.21437/Interspeech.2017-1036
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We investigate how to improve the performance of DNN i-vector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification. due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with "phonetically aware" Deep Neural Net (DNN) in its capability on "stochastic phonetic-alignment" in constructing supervectors and estimating the corresponding i-vectors with two speech databases: a large vocabulary, conversational, speaker independent database (Fisher) and a small vocabulary, continuous digit database (RSR2015 Part III). The phonetic alignment efficiency and resultant speaker verification performance are compared with differently sized senone sets which can characterize the phonetic pronunciations of utterances in the two databases. Performance on RSR2015 Part III evaluation shows a relative improvement of EER, i.e., 7.89% for male speakers and 3.54% for female speakers with only digit related senones. The DNN bottleneck features were also studied to investigate their capability of extracting phonetic sensitive information which is useful for text-independent or text-constrained speaker verifications. We found that by tandeming MFCC with bottleneck features, EERs can be further reduced.
引用
收藏
页码:1507 / 1511
页数:5
相关论文
共 19 条
  • [1] Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection
    Belhumeur, PN
    Hespanha, JP
    Kriegman, DJ
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (07) : 711 - 720
  • [2] Chen LP, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P229
  • [3] Chen LP, 2016, INT CONF ACOUST SPEE, P5485, DOI 10.1109/ICASSP.2016.7472726
  • [4] Front-End Factor Analysis for Speaker Verification
    Dehak, Najim
    Kenny, Patrick J.
    Dehak, Reda
    Dumouchel, Pierre
    Ouellet, Pierre
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04): : 788 - 798
  • [5] Grézl F, 2007, INT CONF ACOUST SPEE, P757
  • [6] Hebert M., 2008, SPRINGER HDB SPEECH, P743
  • [7] Text-dependent speaker verification: Classifiers, databases and RSR2015
    Larcher, Anthony
    Lee, Kong Aik
    Ma, Bin
    Li, Haizhou
    [J]. SPEECH COMMUNICATION, 2014, 60 : 56 - 77
  • [8] Larcher A, 2013, INT CONF ACOUST SPEE, P7673, DOI 10.1109/ICASSP.2013.6639156
  • [9] Feature sparsity analysis for i-vector based speaker verification
    Li, Wei
    Fu, Tianfan
    You, Hanxu
    Zhu, Jie
    Chen, Ning
    [J]. SPEECH COMMUNICATION, 2016, 80 : 60 - 70
  • [10] Prince S. J., 2007, P ICCV, P1, DOI DOI 10.1109/ICCV.2007.4409052