Countermeasures for Automatic Speaker Verification Replay Spoofing Attack : On Data Augmentation, Feature Representation, Classification and Fusion

被引:49
作者
Cai, Weicheng [1 ]
Cai, Danwei [1 ,2 ]
Liu, Wenbo [1 ]
Li, Gang [3 ]
Li, Ming [1 ,2 ]
机构
[1] Sun Yat Sen Univ, SYSU CMU Joint Inst Engn, Sch Elect & Informat Technol, Guangzhou, Guangdong, Peoples R China
[2] SYSU CMU Shunde Int Joint Res Inst, Guangzhou, Guangdong, Peoples R China
[3] Jiangsu Jinling Sci & Technol Grp Ltd, Nanjing, Jiangsu, Peoples R China
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
基金
中国国家自然科学基金;
关键词
ASVspoof; replay attack; data augmentation; end-to-end; representation learning; ResNet; MACHINES;
D O I
10.21437/Interspeeeh.2017-906
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ongoing ASVspoof 2017 challenge aims to detect replay attacks for text dependent speaker verification. In this paper, we propose multiple replay spoofing countermeasure systems, with some of them boosting the CQCC-GMM baseline system after score level fusion. We investigate different steps in the system building pipeline, including data augmentation, feature representation, classification and fusion. First, in order to augment training data and simulate the unseen replay conditions. we converted the raw genuine training data into replay spoofing data with parametric sound reverberator and phase shifter. Second, we employed the original spectrogram rather than C-QCC as input to explore the end-to-end feature representation learning methods. The spectrogram is randomly cropped into fixed size segments, and then fed into a deep residual netowrk (ResNet). Third, upon the CQCC features, we replaced the subsequent GMM classifier with deep neural networks including fully-connected deep neural network (FDNN) and Bidirectional Long Short Term Memory neural network (BLSTM). Experiments showed that data augmentation strategy can significantly improve the system performance. The final fused system achieves to 16.39 % EER on the test set of ASVspoof 2017 for the common task.
引用
收藏
页码:17 / 21
页数:5
相关论文
共 28 条
[1]  
[Anonymous], P INT 2017
[2]  
[Anonymous], 2012, LONG SHORT TERM MEMO
[3]   CALCULATION OF A CONSTANT-Q SPECTRAL TRANSFORM [J].
BROWN, JC .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1991, 89 (01) :425-434
[4]   Support vector machines using GMM supervectors for speaker verification [J].
Campbell, WM ;
Sturim, DE ;
Reynolds, DA .
IEEE SIGNAL PROCESSING LETTERS, 2006, 13 (05) :308-311
[5]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[6]  
Ergunay S K, 2015, IEEE INT C BIOM THEO, P1, DOI DOI 10.1109/BTAS.2015.7358783
[7]  
Evans N., 2013, P INT 2013
[8]   Learning to forget: Continual prediction with LSTM [J].
Gers, FA ;
Schmidhuber, J ;
Cummins, F .
NEURAL COMPUTATION, 2000, 12 (10) :2451-2471
[9]  
Graves A, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P273, DOI 10.1109/ASRU.2013.6707742
[10]  
Hanili C., 2015, P INT 2015