Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech

被引:176
作者
De Leon, Phillip L. [1 ]
Pucher, Michael [2 ]
Yamagishi, Junichi [3 ]
Hernaez, Inma [4 ]
Saratxaga, Ibon [4 ]
机构
[1] New Mexico State Univ, Klipsch Sch Elect & Comp Engn, Las Cruces, NM 88003 USA
[2] Telecommun Res Ctr Vienna FTW, A-1220 Vienna, Austria
[3] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland
[4] Univ Basque Country, Bilbao 48013, Spain
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2012年 / 20卷 / 08期
基金
奥地利科学基金会; 英国工程与自然科学研究理事会;
关键词
Security; speaker recognition; speech synthesis; NORMALIZATION; ALGORITHMS; IMPOSTOR; SYSTEM;
D O I
10.1109/TASL.2012.2201472
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model-universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.
引用
收藏
页码:2280 / 2290
页数:11
相关论文
共 50 条
[31]   Parameterization of Vocal Fry in HMM-Based Speech Synthesis [J].
Silen, Hanna ;
Helander, Elina ;
Nurminen, Jani ;
Gabbouj, Moncef .
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, :1735-+
[32]   A trainable excitation model for HMM-based speech synthesis [J].
Maia, R. ;
Toda, T. ;
Zen, H. ;
Nankaku, Y. ;
Tokuda, K. .
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, :1125-+
[33]   REACTIVE AND CONTINUOUS CONTROL OF HMM-BASED SPEECH SYNTHESIS [J].
Astrinaki, Maria ;
d'Alessandro, Nicolas ;
Picart, Benjamin ;
Drugman, Thomas ;
Dutoit, Thierry .
2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, :252-257
[34]   The Design and Implementation of HMM-based Dai Speech Synthesis [J].
Wang, Zhan ;
Yang, Jian ;
Yang, Xin .
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[35]   HMM SPEAKER VERIFICATION WITH SPARSE TRAINING DATA ON TELEPHONE QUALITY SPEECH [J].
FORSYTH, ME ;
SUTHERLAND, AM ;
ELLIOTT, JA ;
JACK, MA .
SPEECH COMMUNICATION, 1993, 13 (3-4) :411-416
[36]   HMM-based Tibetan Lhasa Speech Synthesis System [J].
Wu Zhiqiang ;
Yu Hongzhi ;
Li Guanyu ;
Wan Shuhui .
2013 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), 2013, :92-95
[37]   HMM-based Speaker Characteristics Emphasis Using Average Voice Model [J].
Nose, Takashi ;
Adada, Junichi ;
Kobayashi, Takao .
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, :2599-2602
[38]   Discrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design and Evaluation [J].
Obin, Nicolas ;
Lanchantin, Pierre ;
Lacheret, Anne ;
Rodet, Xavier .
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, :2796-+
[39]   HMM-BASED SPEECH SYNTHESIS ADAPTATION USING NOISY DATA: ANALYSIS AND EVALUATION METHODS [J].
Karhila, Reima ;
Remes, Ulpu ;
Kurimo, Mikko .
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, :6930-6934
[40]   The integral decode: A smoothing technique for robust HMM-based speaker recognition [J].
Roch, M ;
Hurtig, RR .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2002, 10 (05) :315-324