Towards enhancing emotion recognition via multimodal framework

被引:3
作者
Devi, C. Akalya [1 ]
Renuka, D. Karthika [1 ]
Pooventhiran, G. [2 ]
Harish, D. [3 ]
Yadav, Shweta [4 ]
Thirunarayan, Krishnaprasad [4 ]
机构
[1] PSG Coll Technol, Dept Informat Technol, Coimbatore, Tamil Nadu, India
[2] Qualcomm India Private Ltd Chennai, Chennai, Tamil Nadu, India
[3] Software AG, Bangalore, Karnataka, India
[4] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
关键词
Emotion recognition; time-distributed models; CNN-LSTM; BERT; DCCA;
D O I
10.3233/JIFS-220280
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotional AI is the next era of AI to play a major role in various fields such as entertainment, health care, self-paced online education, etc., considering clues from multiple sources. In this work, we propose a multimodal emotion recognition system extracting information from speech, motion capture, and text data. The main aim of this research is to improve the unimodal architectures to outperform the state-of-the-arts and combine them together to build a robust multi-modal fusion architecture. We developed 1D and 2D CNN-LSTM time-distributed models for speech, a hybrid CNN-LSTM model for motion capture data, and a BERT-based model for text data to achieve state-of-the-art results, and attempted both concatenation-based decision-level fusion and Deep CCA-based feature-level fusion schemes. The proposed speech and mocap models achieve emotion recognition accuracies of 65.08% and 67.51%, respectively, and the BERT-based text model achieves an accuracy of 72.60%. The decision-level fusion approach significantly improves the accuracy of detecting emotions on the IEMOCAP and MELD datasets. This approach achieves 80.20% accuracy on IEMOCAP which is 8.61% higher than the state-of-the-art methods, and 63.52% and 61.65% in 5-class and 7-class classification on the MELD dataset which are higher than the state-of-the-arts.
引用
收藏
页码:2455 / 2470
页数:16
相关论文
共 43 条
[1]  
Andrew G., 2013, PROC INT C MACH LEAR, V28, P1247
[2]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[3]  
Chen SY, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P1597
[4]  
Chernykh V, 2018, Arxiv, DOI arXiv:1701.08071
[5]   Deep neural networks for emotion recognition combining audio and transcripts [J].
Cho, Jaejin ;
Pappagari, Raghavendra ;
Kulkarni, Purva ;
Villalba, Jesus ;
Carmiel, Yishay ;
Dehak, Najim .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :247-251
[6]  
D'Mello S., 2008, Workshop on emotional and cognitive issues at the international conference on intelligent tutoring systems, P306
[7]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arxiv.1810.04805]
[8]   CONSTANTS ACROSS CULTURES IN FACE AND EMOTION [J].
EKMAN, P ;
FRIESEN, WV .
JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY, 1971, 17 (02) :124-&
[9]  
Etienne C., 2018, P WORKSH SPEECH MUS, DOI DOI 10.21437/SMM.2018-5
[10]   Emotional machines: The next revolution [J].
Franzoni, Valentina ;
Milani, Alfredo ;
Nardi, Daniele ;
Vallverdu, Jordi .
WEB INTELLIGENCE, 2019, 17 (01) :1-7