Speech Emotion Recognition Using Speech Feature and Word Embedding

被引:0
作者
Atmaja, Bagus Tris [1 ,2 ]
Shirai, Kiyoaki [2 ]
Akagi, Masato [2 ]
机构
[1] Inst Teknol Sepuluh Nopember, Surabaya, Indonesia
[2] Japan Adv Inst Sci & Technol, Nomi, Japan
来源
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2019年
关键词
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Emotion recognition can be performed automatically from many modalities. This paper presents a categorical speech emotion recognition using speech feature and word embedding. Text features can be combined with speech features to improve emotion recognition accuracy, and both features can be obtained from speech. Here, we use speech segments, by removing silences in an utterance, where the acoustic feature is extracted for speech-based emotion recognition. Word embedding is used as an input feature for text emotion recognition and a combination of both features is proposed for performance improvement purpose. Two unidirectional LSTM layers are used for text and fully connected layers are applied for acoustic emotion recognition. Both networks then are merged by fully connected networks in early fusion way to produce one of four predicted emotion categories. The result shows the combination of speech and text achieve higher accuracy i.e. 75.49% compared to speech only with 58.29% or text only emotion recognition with 68.01%. This result also outperforms the previously proposed methods by others using the same dataset on the same modalities.
引用
收藏
页码:519 / 523
页数:5
相关论文
共 19 条
[1]  
[Anonymous], 2018, ARXIV180405788
[2]  
[Anonymous], 2009, The unreasonable effectiveness of data"
[3]  
[Anonymous], P 4 INT WORKSH LAUGH
[4]  
[Anonymous], 1999, P ART NEUR NETW ENG
[5]  
[Anonymous], 2016, ADV NEURAL INFORM PR
[6]  
Bahdanau Dzmitry, 2015, 3 INT C LEARN REPR I
[7]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[8]  
Chernykh Vladimir, 2017, ARXIV170108071
[9]   Deep neural networks for emotion recognition combining audio and transcripts [J].
Cho, Jaejin ;
Pappagari, Raghavendra ;
Kulkarni, Purva ;
Villalba, Jesus ;
Carmiel, Yishay ;
Dehak, Najim .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :247-251
[10]   Evaluating deep learning architectures for Speech Emotion Recognition [J].
Fayek, Haytham M. ;
Lech, Margaret ;
Cavedon, Lawrence .
NEURAL NETWORKS, 2017, 92 :60-68