Feature Pooling of Modulation Spectrum Features for Improved Speech Emotion Recognition in the Wild

被引:28
作者
Avila, Anderson R. [1 ]
Akhtar, Zahid [1 ]
Santos, Joao F. [1 ]
O'Shaughnessy, Douglas [1 ]
Falk, Tiago H. [1 ]
机构
[1] INRS EMT, Telecommun, Montreal, PQ, Canada
基金
加拿大自然科学与工程研究理事会; 欧盟地平线“2020”;
关键词
Affective computing; speech emotion recognition; modulation spectrum; in-the-wild; NEURAL-NETWORKS; FREQUENCY;
D O I
10.1109/TAFFC.2018.2858255
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Interest in affective computing is burgeoning, in great part due to its role in emerging affective human-computer interfaces (HCI). To date, the majority of existing research on automated emotion analysis has relied on data collected in controlled environments. With the rise of HCI applications on mobile devices, however, so-called "in-the-wild" settings have posed a serious threat for emotion recognition systems, particularly those based on voice. In this case, environmental factors such as ambient noise and reverberation severely hamper system performance. In this paper, we quantify the detrimental effects that the environment has on emotion recognition and explore the benefits achievable with speech enhancement. Moreover, we propose a modulation spectral feature pooling scheme that is shown to outperform a state-of-the-art benchmark system for environment-robust prediction of spontaneous arousal and valence emotional primitives. Experiments on an environment-corrupted version of the RECOLA dataset of spontaneous interactions show the proposed feature pooling scheme, combined with speech enhancement, outperforming the benchmark across different noise-only, reverberation-only and noise-plus-reverberation conditions. Additional tests with the SEWA database show the benefits of the proposed method for in-the-wild applications.
引用
收藏
页码:177 / 188
页数:12
相关论文
共 44 条
[21]   EmoNets: Multimodal deep learning approaches for emotion recognition in video [J].
Kahou, Samira Ebrahimi ;
Bouthillier, Xavier ;
Lamblin, Pascal ;
Gulcehre, Caglar ;
Michalski, Vincent ;
Konda, Kishore ;
Jean, Sebastien ;
Froumenty, Pierre ;
Dauphin, Yann ;
Boulanger-Lewandowski, Nicolas ;
Ferrari, Raul Chandias ;
Mirza, Mehdi ;
Warde-Farley, David ;
Courville, Aaron ;
Vincent, Pascal ;
Memisevic, Roland ;
Pal, Christopher ;
Bengio, Yoshua .
JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (02) :99-111
[22]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90
[23]   Deep learning [J].
LeCun, Yann ;
Bengio, Yoshua ;
Hinton, Geoffrey .
NATURE, 2015, 521 (7553) :436-444
[24]   Stress and emotion classification using jitter and shimmer features [J].
Li, Xi ;
Tao, Jidong ;
Johnson, Michael T. ;
Soltis, Joseph ;
Savage, Anne ;
Leong, Kirsten M. ;
Newman, John D. .
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, :1081-+
[25]   A CONCORDANCE CORRELATION-COEFFICIENT TO EVALUATE REPRODUCIBILITY [J].
LIN, LI .
BIOMETRICS, 1989, 45 (01) :255-268
[26]   P.563 - The ITU-T standard for single-ended speech quality assessment [J].
Malfait, Ludovic ;
Berger, Jens ;
Kastner, Martin .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (06) :1924-1934
[27]  
Pascanu R., 2013, INT C MACH LEARN, P1310
[28]   Affective computing: challenges [J].
Picard, RW .
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2003, 59 (1-2) :55-64
[29]   Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition [J].
Pohjalainen, Jouni ;
Ringeval, Fabien ;
Zhang, Zixing ;
Schuller, Bjoern .
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, :670-674
[30]  
Ringeval F, 2013, IEEE INT CONF AUTOMA