Towards enhancing emotion recognition via multimodal framework

被引:3
作者
Devi, C. Akalya [1 ]
Renuka, D. Karthika [1 ]
Pooventhiran, G. [2 ]
Harish, D. [3 ]
Yadav, Shweta [4 ]
Thirunarayan, Krishnaprasad [4 ]
机构
[1] PSG Coll Technol, Dept Informat Technol, Coimbatore, Tamil Nadu, India
[2] Qualcomm India Private Ltd Chennai, Chennai, Tamil Nadu, India
[3] Software AG, Bangalore, Karnataka, India
[4] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
关键词
Emotion recognition; time-distributed models; CNN-LSTM; BERT; DCCA;
D O I
10.3233/JIFS-220280
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotional AI is the next era of AI to play a major role in various fields such as entertainment, health care, self-paced online education, etc., considering clues from multiple sources. In this work, we propose a multimodal emotion recognition system extracting information from speech, motion capture, and text data. The main aim of this research is to improve the unimodal architectures to outperform the state-of-the-arts and combine them together to build a robust multi-modal fusion architecture. We developed 1D and 2D CNN-LSTM time-distributed models for speech, a hybrid CNN-LSTM model for motion capture data, and a BERT-based model for text data to achieve state-of-the-art results, and attempted both concatenation-based decision-level fusion and Deep CCA-based feature-level fusion schemes. The proposed speech and mocap models achieve emotion recognition accuracies of 65.08% and 67.51%, respectively, and the BERT-based text model achieves an accuracy of 72.60%. The decision-level fusion approach significantly improves the accuracy of detecting emotions on the IEMOCAP and MELD datasets. This approach achieves 80.20% accuracy on IEMOCAP which is 8.61% higher than the state-of-the-art methods, and 63.52% and 61.65% in 5-class and 7-class classification on the MELD dataset which are higher than the state-of-the-arts.
引用
收藏
页码:2455 / 2470
页数:16
相关论文
共 43 条
  • [1] Andrew R., 2013, INT C MACH LEARN, P1247
  • [2] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [3] Chen SY, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P1597
  • [4] Chernykh V, 2018, Arxiv, DOI arXiv:1701.08071
  • [5] Deep neural networks for emotion recognition combining audio and transcripts
    Cho, Jaejin
    Pappagari, Raghavendra
    Kulkarni, Purva
    Villalba, Jesus
    Carmiel, Yishay
    Dehak, Najim
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 247 - 251
  • [6] D'Mello S., 2008, Workshop on emotional and cognitive issues at the international conference on intelligent tutoring systems, P306
  • [7] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [8] CONSTANTS ACROSS CULTURES IN FACE AND EMOTION
    EKMAN, P
    FRIESEN, WV
    [J]. JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY, 1971, 17 (02) : 124 - &
  • [9] Etienne C., 2018, P WORKSH SPEECH MUS, DOI DOI 10.21437/SMM.2018-5
  • [10] Emotional machines: The next revolution
    Franzoni, Valentina
    Milani, Alfredo
    Nardi, Daniele
    Vallverdu, Jordi
    [J]. WEB INTELLIGENCE, 2019, 17 (01) : 1 - 7