Improving the self-adaptive voice activity detector for speaker verification using map adaptation and asymmetric tapers

被引:4
作者
Asbai N. [1 ,2 ]
Bengherabi M. [1 ]
Amrouche A. [2 ]
Aklouf Y. [2 ]
机构
[1] Center for Development of Advanced Technologies (CDTA), Baba Hassen
[2] Speech Com. & Signal Proc. Lab, USTHB, Bab Ezzouar
关键词
Asymmetric tapers; GMM–UBM; MAP adaptation; Noisy conditions; Self-adaptive VAD; VQ-VAD;
D O I
10.1007/s10772-014-9260-6
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper brings an improvement of voice activity detection, based on vector quantization and speech enhancement preprocessing (VQ-VAD) proposed recently, and applied to speaker verification system under noisy environment. VQ-VAD is based on computing the likelihood ratio on an utterance-by utterance basis from mel-frequency cepstral coefficients that train speech and non-speech models. Whereas the notion of speech and non-speech segments in speech signal is independent of the speaker. For this, a modified VQ-VAD technique is proposed in this paper, by creating two UBM’s for speech and non-speech models, trained from a long utterance-independence model. Then, an adaptation of UBM’s models to the short utterance of speaker is performed via MAP adaptation, instead of using VQ models. Mel-frequency cepstral coefficient’s were also extracted by using the recently proposed asymmetric tapers instead of the traditional Hamming windowing. Using the GMM–UBM as a baseline system for speaker verification, extensive simulation results were done by adding different noise levels to the clean TIMIT database, characterized by its short training and very short testing utterances. The obtained results show the superiority of the proposed GMM-MAP-VAD approach in adverse conditions. Furthermore a drastic reduction in the EER is observed when using asymmetric tapers. © 2014, Springer Science+Business Media New York.
引用
收藏
页码:195 / 203
页数:8
相关论文
共 17 条
[1]  
Amrouche A., Debyeche M., Taleb-Ahmed A., Michel Rouvaen J., Yagoub M.C., An efficient speech recognition system in adverse conditions using the nonparametric regression, Engineering Applications of Artificial Intelligence, 23, 1, pp. 85-94, (2010)
[2]  
Dehak N., Kenny P., Dehak R., Dumouchel P., Ouellet P., Front-end factor analysis for speaker verification, Audio, Speech, and Language Processing, IEEE Transactions on, 19, 4, pp. 788-798, (2011)
[3]  
Do M.N., Fast approximation of Kullback–Leibler distance for dependence trees and hidden Markov models, Signal Processing Letters, IEEE, 10, 4, pp. 115-118, (2003)
[4]  
Gauvain J.L., Lee C.H., Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, Speech and Audio Processing, IEEE Transactions on, 2, 2, pp. 291-298, (1994)
[5]  
Gerkmann T., Hendriks R.C., Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, Audio, Speech, and Language Processing, IEEE Transactions on, 20, 4, pp. 1383-1393, (2012)
[6]  
Gonzalez-Rodriguez J., Drygajlo A., Ramos-Castro D., Garcia-Gomar M., Ortega-Garcia J., Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition, Computer Speech & Language, 20, 2, pp. 331-355, (2006)
[7]  
Kanungo T., Mount D.M., Netanyahu N.S., Piatko C.D., Silverman R., Wu A.Y., An efficient k-means clustering algorithm: Analysis and implementation, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24, 7, pp. 881-892, (2002)
[8]  
Kenny P., Boulianne G., Ouellet P., Dumouchel P., Joint factor analysis versus eigenchannels in speaker recognition, Audio, Speech, and Language Processing, IEEE Transactions on, 15, 4, pp. 1435-1447, (2007)
[9]  
Kinnunen T., Li H., An overview of text-independent speaker recognition: From features to supervectors, Speech Communication, 52, 1, pp. 12-40, (2010)
[10]  
Linde Y., Buzo A., Gray R.M., An algorithm for vector quantizer design, Communications, IEEE Transactions on, 28, 1, pp. 84-95, (1980)