Investigation on LP-residual representations for speaker identification

被引:24
作者
Chetouani, M. [1 ]
Faundez-Zanuy, M. [2 ]
Gas, B. [1 ]
Zarader, J. L. [1 ]
机构
[1] Univ Paris 06, F-75252 Paris 05, France
[2] Escola Univ Politecn Mataro, Barcelona, Spain
关键词
Feature extraction; Speaker identification; LP-residue; Non-linear speech processing; EXTRACTION;
D O I
10.1016/j.patcog.2008.08.008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature extraction is an essential and important step for speaker recognition systems. In this paper, we propose to improve these systems by exploiting both conventional features such as mel frequency cepstral coding (MFCC), linear predictive cepstral coding (LPCC) and non-conventional ones. The method exploits information present in the linear predictive (LP) residual signal. The features extracted from the LP-residue are then combined to the MFCC or the LPCC. We investigate two approaches termed as temporal and frequential representations. The first one consists of an auto-regressive (AR) modelling of the signal followed by a cepstral transformation in a similar way to the LPC-LPCC transformation. In order to take into account the non-linear nature of the speech signals we used two estimation methods based on second and third-order statistics. They are, respectively, termed as R-SOS-LPCC (residual plus second-order statistic based estimation of the AR model plus cepstral transformation) and R-HOS-LPCC (higher order). Concerning the frequential approach, we exploit a filter bank method called the power difference of spectra in sub-band (PDSS) which measures the spectral flatness over the sub-bands. The resulting features are named R-PDSS. The analysis of these proposed schemes are done over a speaker identification problem with two different databases. The first one is the Gaudi database and contains 49 speakers. The main interest lies in the controlled acquisition conditions: mismatch between the microphones and the interval sessions. The second database is the well-known NTIMIT corpus with 630 speakers. The performances of the features are confirmed over this larger corpus. In addition, we propose to compare traditional features and residual ones by the fusion of recognizers (feature extractor + classifier). The results show that residual features carry speaker-dependent features and the combination with the LPCC or the MFCC shows global improvements in terms of robustness under different mismatches. A comparison between the residual features under the opinion fusion framework gives us useful information about the potential of both temporal and frequential representations. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:487 / 494
页数:8
相关论文
共 36 条
[1]   SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTION OF SPEECH WAVE [J].
ATAL, BS ;
HANAUER, SL .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1971, 50 (02) :637-+
[2]   Subband architecture for automatic speaker recognition [J].
Besacier, L ;
Bonastre, JF .
SIGNAL PROCESSING, 2000, 80 (07) :1245-1259
[3]   2ND-ORDER STATISTICAL MEASURES FOR TEXT-INDEPENDENT SPEAKER IDENTIFICATION [J].
BIMBOT, F ;
MAGRINCHAGNOLLEAU, I ;
MATHAN, L .
SPEECH COMMUNICATION, 1995, 17 (1-2) :177-192
[4]  
BIMBOT F, 1999, P EUROSPEECH 91, P169
[5]  
CHEN S, 2004, P IEEE ICASSP 2004, V1, P93
[6]  
Chetouani M, 2005, LECT NOTES ARTIF INT, V3445, P344
[7]  
CHETOUANI M, 2004, P ISCA TUT RES WORKS, P309
[8]  
CHOLLET G, 2005, LECT NOTES ARTIFICIA, V3445
[9]  
Esposito A, 2005, LECT NOTES ARTIF INT, V3445, P1
[10]  
FAUNDEZ M, 1999, P IEEE ICASSP 99