Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms

被引：73

作者：

Ma, Xi ^{[1
,3
]}

Wu, Zhiyong ^{[1
,2
,3
]}

Jia, Jia ^{[1
,3
]}

Xu, Mingxing ^{[1
,3
]}

Meng, Helen ^{[1
,2
]}

Cai, Lianhong ^{[1
,3
]}

机构：

[1] Tsinghua Univ, Grad Sch Shenzhen, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen 518055, Peoples R China

[2] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China

[3] Tsinghua Univ, Dept Comp Sci & Technol, TNList, Beijing 100084, Peoples R China

来源：

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年

基金：

中国国家自然科学基金;

关键词：

Speech Emotion Recognition; Variable-Length Speech Segments; Spectrogram; Deep Neural Network;

D O I：

10.21437/Interspeech.2018-2228

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, an approach of emotion recognition is proposed for variable-length speech segments by applying deep neutral network to spectrograms directly. The spectrogram carries comprehensive para-lingual information that are useful for emotion recognition. We tried to extract such information from spectrograms and accomplish the emotion recognition task by combining Convolutional Neural Networks (CNNs) with Recurrent Neural Networks (RNNs). To handle the variable length speech segments, we proposed a specially designed neural network structure that accepts variable-length speech sentences directly as input. Compared to the traditional methods that split the sentence into smaller fixed-length segments, our method can solve the problem of accuracy degradation introduced by the speech segmentation process. We evaluated the emotion recognition model on the IEMOCAP dataset over four emotions. Experimental results demonstrate that the proposed method outperforms the fixed-length neural network on both weighted accuracy (WA) and unweighted accuracy (UA).

引用

页码：3683 / 3687

页数：5

共 18 条

[1]

[Anonymous], 2015, COMPUTER SCI

[2]

[Anonymous], 2014, COMPUTER SCI

[3]

[Anonymous], 2014, INTERSPEECH

[4]

Bhargava M., 2015, Interspeech

[5] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[6]

Chernykh Vladimir, 2017, ARXIV170108071

[7] Survey on speech emotion recognition: Features, classification schemes, and databases [J].

El Ayadi, Moataz ;

Kamel, Mohamed S. ;

Karray, Fakhri .

PATTERN RECOGNITION, 2011, 44 (03) :572-587

[8] Speech Emotion Recognition Using CNN [J].

Huang, Zhengwei ;

Dong, Ming ;

Mao, Qirong ;

Zhan, Yongzhao .

PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :801-804

[9]

Jaitly N, 2011, INT CONF ACOUST SPEE, P5884

[10]

Lee Jinkyu, 2015, INTERSPEECH

← 1 2 →