Speaker recognition with hybrid features from a deep belief network

被引:68
作者
Ali, Hazrat [1 ]
Tran, Son N. [2 ]
Benetos, Emmanouil [2 ,3 ]
Garcez, Artur S. d'Avila [2 ]
机构
[1] COMSATS Inst Informat Technol, Dept Elect Engn, Univ Rd, Abbottabad 22060, Pakistan
[2] City Univ London, Dept Comp Sci, Northampton Sq, London EC1V 0HB, England
[3] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London, England
关键词
Deep belief networks; Deep learning; Mel-frequency cepstral coefficients; AUTOMATIC SPEECH RECOGNITION; MACHINES;
D O I
10.1007/s00521-016-2501-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning representation from audio data has shown advantages over the handcrafted features such as mel-frequency cepstral coefficients (MFCCs) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different lengths. In particular, we study the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts. These vectors represent the audio scripts of different lengths that make them easier to train a classifier. We show in the experiment that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio word count vector and the MFCC features.
引用
收藏
页码:13 / 19
页数:7
相关论文
共 21 条
[1]  
Ali H., 2012, 2012 INT C EL COMP T, P473
[2]   Linear Discriminant Analysis Based Approach for Automatic Speech Recognition of Urdu Isolated Words [J].
Ali, Hazrat ;
Ahmad, Nasir ;
Zhou, Xianwei ;
Ali, Muhammad ;
Manjotho, Ali Asghar .
COMMUNICATION TECHNOLOGIES, INFORMATION SECURITY AND SUSTAINABLE DEVELOPMENT, 2014, 414 :24-34
[3]   DWT features performance analysis for automatic speech recognition of Urdu [J].
Ali, Hazrat ;
Ahmad, Nasir ;
Zhou, Xianwei ;
Iqbal, Khalid ;
Ali, Sahibzada Muhammad .
SPRINGERPLUS, 2014, 3 :1-10
[4]   Unimodal late fusion for NIST i-vector challenge on speaker detection [J].
Ali, Hazrat ;
Garcez, Artur S. d'Avila ;
Tran, Son N. ;
Zhou, Xianwei ;
Iqbal, Khalid .
ELECTRONICS LETTERS, 2014, 50 (15) :1098-1099
[5]  
[Anonymous], 1993, DARPATIMIT ACOUSTIC
[6]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[7]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[9]  
Deng L., 2014, FOND T SIGN PROC, V7, P197, DOI DOI 10.1561/2000000039
[10]  
Freund Y., 1994, TECH REP