Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks

被引:25
作者
Aslam, Ajwa [1 ]
Sargano, Allah Bux [1 ]
Habib, Zulfiqar [1 ]
机构
[1] COMSATS Univ Islamabad, Dept Comp Sci, Lahore 54000, Punjab, Pakistan
关键词
Sentiment analysis; Emotion recognition; Multimodal attention; Deep neural networks; FUSION;
D O I
10.1016/j.asoc.2023.110494
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There has been a growing interest in multimodal sentiment analysis and emotion recognition in recent years due to its wide range of practical applications. Multiple modalities allow for the integration of complementary information, improving the accuracy and precision of sentiment and emotion recognition tasks. However, working with multiple modalities presents several challenges, including handling data source heterogeneity, fusing information, aligning and synchronizing modalities, and designing effective feature extraction techniques that capture discriminative information from each modality. This paper introduces a novel framework called "Attention-based Multimodal Sentiment Analysis and Emotion Recognition (AMSAER)"to address these challenges. This framework leverages intra-modality discriminative features and inter-modality correlations in visual, audio, and textual modalities. It incorporates an attention mechanism to facilitate sentiment and emotion classification based on visual, textual, and acoustic inputs by emphasizing relevant aspects of the task. The proposed approach employs separate models for each modality to automatically extract discriminative semantic words, image regions, and audio features. A deep hierarchical model is then developed, incorporating intermediate fusion to learn hierarchical correlations between the modalities at bimodal and trimodal levels. Finally, the framework combines four distinct models through decision-level fusion to enable multimodal sentiment analysis and emotion recognition. The effectiveness of the proposed framework is demonstrated through extensive experiments conducted on the publicly available Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. The results confirm a notable performance improvement compared to state-of-the-art methods, attaining 85% and 93% accuracy for sentiment analysis and emotion classification, respectively. Additionally, when considering class-wise accuracy, the results indicate that the "angry"emotion and "positive"sentiment are classified more effectively than the other emotions and sentiments, achieving 96.80% and 93.14% accuracy, respectively.& COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:16
相关论文
共 83 条
[1]   Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey [J].
Abdu, Sarah A. ;
Yousef, Ahmed H. ;
Salem, Ashraf .
INFORMATION FUSION, 2021, 76 :204-226
[2]   The potential of a novel support vector machine trained with modified mayfly optimization algorithm for streamflow prediction [J].
Adnan, Rana Muhammad ;
Kisi, Ozgur ;
Mostafa, Reham R. ;
Ahmed, Ali Najah ;
El-Shafie, Ahmed .
HYDROLOGICAL SCIENCES JOURNAL, 2022, 67 (02) :161-174
[3]  
Aldeneh Z, 2017, PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2017, P68, DOI 10.1145/3136755.3136760
[4]   T-SAF: Twitter sentiment analysis framework using a hybrid classification scheme [J].
Asghar, Muhammad Zubair ;
Kundi, Fazal Masud ;
Ahmad, Shakeel ;
Khan, Aurangzeb ;
Khan, Furqan .
EXPERT SYSTEMS, 2018, 35 (01)
[5]   Audiovisual emotion recognition in wild [J].
Avots, Egils ;
Sapinski, Tomasz ;
Bachmann, Maie ;
Kaminska, Dorota .
MACHINE VISION AND APPLICATIONS, 2019, 30 (05) :975-985
[6]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[7]  
Borth D., 2013, LARGE SCALE VISUAL S, P223, DOI [10.1145/2502081.2502282, DOI 10.1145/2502081.2502282]
[8]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[9]   Feature-level and Model-level Audiovisual Fusion for Emotion Recognition in the Wild [J].
Cai, Jie ;
Meng, Zibo ;
Khan, Ahmed Shehab ;
Li, Zhiyuan ;
O'Reilly, James ;
Han, Shizhong ;
Liu, Ping ;
Chen, Min ;
Tong, Yan .
2019 2ND IEEE CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2019), 2019, :443-448
[10]   Benchmarking Multimodal Sentiment Analysis [J].
Cambria, Erik ;
Hazarika, Devamanyu ;
Poria, Soujanya ;
Hussain, Amir ;
Subramanyam, R. B. V. .
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2017, PT II, 2018, 10762 :166-179