Towards enhancing emotion recognition via multimodal framework

被引：3

作者：

Devi, C. Akalya ^{[1
]}

Renuka, D. Karthika ^{[1
]}

Pooventhiran, G. ^{[2
]}

Harish, D. ^{[3
]}

Yadav, Shweta ^{[4
]}

Thirunarayan, Krishnaprasad ^{[4
]}

机构：

[1] PSG Coll Technol, Dept Informat Technol, Coimbatore, Tamil Nadu, India

[2] Qualcomm India Private Ltd Chennai, Chennai, Tamil Nadu, India

[3] Software AG, Bangalore, Karnataka, India

[4] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA

来源：

JOURNAL OF INTELLIGENT & FUZZY SYSTEMS | 2023年 / 44卷 / 02期

关键词：

Emotion recognition; time-distributed models; CNN-LSTM; BERT; DCCA;

D O I：

10.3233/JIFS-220280

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotional AI is the next era of AI to play a major role in various fields such as entertainment, health care, self-paced online education, etc., considering clues from multiple sources. In this work, we propose a multimodal emotion recognition system extracting information from speech, motion capture, and text data. The main aim of this research is to improve the unimodal architectures to outperform the state-of-the-arts and combine them together to build a robust multi-modal fusion architecture. We developed 1D and 2D CNN-LSTM time-distributed models for speech, a hybrid CNN-LSTM model for motion capture data, and a BERT-based model for text data to achieve state-of-the-art results, and attempted both concatenation-based decision-level fusion and Deep CCA-based feature-level fusion schemes. The proposed speech and mocap models achieve emotion recognition accuracies of 65.08% and 67.51%, respectively, and the BERT-based text model achieves an accuracy of 72.60%. The decision-level fusion approach significantly improves the accuracy of detecting emotions on the IEMOCAP and MELD datasets. This approach achieves 80.20% accuracy on IEMOCAP which is 8.61% higher than the state-of-the-art methods, and 63.52% and 61.65% in 5-class and 7-class classification on the MELD dataset which are higher than the state-of-the-arts.

引用

页码：2455 / 2470

页数：16

共 43 条

[1]

Andrew G., 2013, PROC INT C MACH LEAR, V28, P1247

[2] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[3]

Chen SY, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P1597

[4]

Chernykh V, 2018, Arxiv, DOI arXiv:1701.08071

[5] Deep neural networks for emotion recognition combining audio and transcripts [J].

Cho, Jaejin ;

Pappagari, Raghavendra ;

Kulkarni, Purva ;

Villalba, Jesus ;

Carmiel, Yishay ;

Dehak, Najim .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :247-251

[6]

D'Mello S., 2008, Workshop on emotional and cognitive issues at the international conference on intelligent tutoring systems, P306

[7]

Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arxiv.1810.04805]

[8] CONSTANTS ACROSS CULTURES IN FACE AND EMOTION [J].

EKMAN, P ;

FRIESEN, WV .

JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY, 1971, 17 (02) :124-&

[9]

Etienne C., 2018, P WORKSH SPEECH MUS, DOI DOI 10.21437/SMM.2018-5

[10] Emotional machines: The next revolution [J].

Franzoni, Valentina ;

Milani, Alfredo ;

Nardi, Daniele ;

Vallverdu, Jordi .

WEB INTELLIGENCE, 2019, 17 (01) :1-7

← 1 2 3 4 5 →