Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms

被引:266
作者
Satt, Aharon [1 ]
Rozenberg, Shai [1 ,2 ]
Hoory, Ron [1 ]
机构
[1] IBM Res Haifa, Haifa, Israel
[2] Technion Israel Inst Technol, Haifa, Israel
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
Speech Emotion Recognition; Para-lingual; Deep Neural Network; Spectrogram;
D O I
10.21437/Interspeech.2017-200
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a new implementation of emotion recognition from the para-lingual information in the speech, based on a deep neural network, applied directly to spectrograms. This new method achieves higher recognition accuracy compared to previously published results, while also limiting the latency. It processes the speech input in smaller segments - up to 3 seconds, and splits a longer input into non-overlapping parts to reduce the prediction latency. The deep network comprises common neural network tools - convolutional and recurrent networks - which arc shown to effectively learn the information that represents emotions directly from spectrograms. Convolution-only lower complexity deep network achieves a prediction accuracy of 66% over four emotions (tested on IEMOCAP - a common evaluation corpus), while a combined convolution-LSTM higher-complexity model achieves 68%. The use of spectrograms in the role of speech-representing features enables effective handling of background non-speech signals such as music (excl. singing) and crowd noise, even at noise levels comparable with the speech signal levels. Using harmonic modeling to remove non-speech components from the spectrogram, we demonstrate significant improvement of the emotion recognition accuracy in the presence of unknown background non-speech signals.
引用
收藏
页码:1089 / 1093
页数:5
相关论文
共 25 条
  • [1] Amodci Dario, 2015, ARXIV151202595
  • [2] [Anonymous], ICML
  • [3] [Anonymous], 2014, P 22 ACM INT C MULT
  • [4] [Anonymous], AC SPEECH SIGN PROC
  • [5] [Anonymous], 2014, APSIPA T SIGNAL INFO
  • [6] [Anonymous], 2015, MARK RES REP
  • [7] [Anonymous], 2014, INTERSPEECH
  • [8] [Anonymous], 2016, EM DET REC MARK TECH
  • [9] Boersma Paulus Petrus Gerardus, 2001, GLUT INTERNATIONAL, V5
  • [10] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359