Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks

被引:413
作者
Mao, Qirong [1 ,2 ]
Dong, Ming [1 ]
Huang, Zhengwei [2 ]
Zhan, Yongzhao [2 ]
机构
[1] Wayne State Univ, Dept Comp Sci, Detroit, MI 48202 USA
[2] Jiangsu Univ, Dept Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China
关键词
Affective-salient discriminative feature analysis; convolutional neural networks; feature learning; speech emotion recognition;
D O I
10.1109/TMM.2014.2360798
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect-related, discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e. g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.
引用
收藏
页码:2203 / 2213
页数:11
相关论文
共 47 条
[1]  
[Anonymous], 2010, 2010 IEEE 39 APPL IM, DOI DOI 10.1109/AIPR.2010.5759701
[2]  
[Anonymous], 2009, AVSP
[3]  
[Anonymous], 2009, THESIS U ERLANGEN NU
[4]   ASR emotional speech: Clarifying the issues and enhancing performance [J].
Athanaselis, T ;
Bakamidis, S ;
Dologlou, I ;
Cowie, R ;
Douglas-Cowie, E ;
Cox, C .
NEURAL NETWORKS, 2005, 18 (04) :437-444
[5]   Representation Learning: A Review and New Perspectives [J].
Bengio, Yoshua ;
Courville, Aaron ;
Vincent, Pascal .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828
[6]  
Burkhardt F., 2005, INTERSPEECH, V5, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[7]   Sparse Autoencoder-based Feature Transfer Learning for Speech Emotion Recognition [J].
Deng, Jun ;
Zhang, Zixing ;
Marchi, Erik ;
Schuller, Bjoern .
2013 HUMAINE ASSOCIATION CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2013, :511-516
[8]   Survey on speech emotion recognition: Features, classification schemes, and databases [J].
El Ayadi, Moataz ;
Kamel, Mohamed S. ;
Karray, Fakhri .
PATTERN RECOGNITION, 2011, 44 (03) :572-587
[9]  
Engberg I. S., 1997, 5 EUR C SPEECH COMM
[10]  
Eyben P., 2009, PROC IEEE 4 INT HUMA, P576