Iterative Feature Normalization Scheme for Automatic Emotion Detection from Speech

被引:39
作者
Busso, Carlos [1 ]
Mariooryad, Soroosh [1 ]
Metallinou, Angeliki [2 ]
Narayanan, Shrikanth [3 ]
机构
[1] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, Richardson, TX 75080 USA
[2] Pearson Knowledge Technol, Menlo Pk, CA USA
[3] Univ So Calif, Viterbi Sch Engn, Los Angeles, CA 90089 USA
基金
美国国家科学基金会;
关键词
Emotion recognition; speaker normalization; emotion; features normalization; FUNDAMENTAL-FREQUENCY; RECOGNITION; LEVEL;
D O I
10.1109/T-AFFC.2013.26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The externalization of emotion is intrinsically speaker-dependent. A robust emotion recognition system should be able to compensate for these differences across speakers. A natural approach is to normalize the features before training the classifiers. However, the normalization scheme should not affect the acoustic differences between emotional classes. This study presents the iterative feature normalization (IFN) framework, which is an unsupervised front-end, especially designed for emotion detection. The IFN approach aims to reduce the acoustic differences, between the neutral speech across speakers, while preserving the inter-emotional variability in expressive speech. This goal is achieved by iteratively detecting neutral speech for each speaker, and using this subset to estimate the feature normalization parameters. Then, an affine transformation is applied to both neutral and emotional speech. This process is repeated till the results from the emotion detection system are consistent between consecutive iterations. The IFN approach is exhaustively evaluated using the IEMOCAP database and a data set obtained under free uncontrolled recording conditions with different evaluation configurations. The results show that the systems trained with the IFN approach achieve better performance than systems trained either without normalization or with global normalization.
引用
收藏
页码:386 / 397
页数:12
相关论文
共 47 条
[1]  
[Anonymous], P INT
[2]  
[Anonymous], 2004, 6 INT C MULTIMODAL I
[3]  
[Anonymous], 2007, INTERSPEECH 2007 8 A
[4]   Shape-based modeling of the fundamental frequency contour for emotion detection in speech [J].
Arias, Juan Pablo ;
Busso, Carlos ;
Yoma, Nestor Becerra .
COMPUTER SPEECH AND LANGUAGE, 2014, 28 (01) :278-294
[5]  
Batliner A, 2000, ART INTEL, P106
[6]   Class-level spectral features for emotion recognition [J].
Bitouk, Dmitri ;
Verma, Ragini ;
Nenkova, Ani .
SPEECH COMMUNICATION, 2010, 52 (7-8) :613-625
[7]  
Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962
[8]  
Bone D., 2011, INTERSPEECH, P3217
[9]  
Bone Daniel, 2012, COMPUTER SPEECH LANG
[10]  
Busso C, 2007, INT CONF ACOUST SPEE, P685