A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis

被引:0
作者
Ganesh Kumar P
Arul Antran Vijay S
Jothi Prakash V
Anand Paul
Anand Nayyar
机构
[1] College of Engineering,Department of Computer Science and Engineering
[2] Guindy,The School of Computer Science and Engineering
[3] Anna University,School of Computer Science, Faculty of Information Technology
[4] Karpagam College of Engineering,undefined
[5] Kyungpook National University,undefined
[6] Duy Tan University,undefined
来源
Multimedia Tools and Applications | 2024年 / 83卷
关键词
Deep learning; Gated recurrent unit; Information retrieval; Multimedia analysis; Multimodal sentiment analysis; Sentiment analysis;
D O I
暂无
中图分类号
学科分类号
摘要
One of the most appealing multidisciplinary research areas in Artificial Intelligence (AI) is Sentiment Analysis (SA). Due to the intricate and complementary interactions between several modalities, Multimodal Sentiment Analysis (MSA) is an extremely difficult work that has a wide range of applications. In the subject of multimodal sentiment analysis, numerous deep learning models and different techniques have been suggested, but they do not investigate the explicit context of words and are unable to model diverse components of a sentence. Hence, the full potential of such diverse data has not been explored. In this research, a Context-Sensitive Multi-Tier Deep Learning Framework (CS-MDF) is proposed for sentiment analysis on multimodal data. The CS-MDF uses a three-tier architecture for extracting context-sensitive information. The first tier utilizes Convolutional Neural Network (CNN) for extracting text-based features, 3D-CNN model for extracting visual features and open-Source Media Interpretation by Large feature-space Extraction (openSMILE) tool kit for audio feature extraction.The first tier focuses on extracting the unimodal features from the utterances. This level of extraction ignores context-sensitive data while determining the feature.CNNs are suitable for text data because they are particularly useful for identifying local patterns and dependencies in data.The second tier uses the features extracted from the first tier.The context-sensitive unimodal characteristics are extracted in this tier using the Bi-directional Gated Recurrent Unit (BiGRU), which is used to comprehend inter-utterance links and uncover contextual evidence.The output from tier two is combined and passed to the third tier, which fuses the features from different modalities and trains a single BiGRU model that provides the final classification.This method applies the BiGRU model to sequential data processing, using the advantages of both modalities and capturing their interdependencies.Experimental results obtained on six real-life datasets (Flickr Images dataset, Multi-View Sentiment Analysis dataset, Getty Images dataset, Balanced Twitter for Sentiment Analysis dataset, CMU-MOSI Dataset) show that the proposed CS-MDF model has achieved better performance compared with ten state-of-the-art approaches, which are validated by F1 score, precision, accuracy, and recall metrics.An ablation study is carried out on the proposed framework that demonstrates the viability of the design. The GradCAM visualization technique is applied to visualize the aligned input image-text pairs learned by the proposed CS-MDF model.
引用
收藏
页码:54249 / 54278
页数:29
相关论文
共 69 条
[1]  
Praveena HD(2022)Effective cbmir system using hybrid features-based independent condensed nearest neighbor model J Health Eng 2022 1-9
[2]  
Guptha NS(2018)Detection of macro and micro nodule using online region based-active contour model in histopathological liver cirrhosis Int J Intell Eng Syst 11 256-265
[3]  
Kazemzadeh A(2017)Earth mover’s distance-based cbir using adaptive regularised kernel fuzzy c-means method of liver cirrhosis histopathological segmentation Int J Signal Imaging Syst Eng 10 39-57
[4]  
Parameshachari BD(2014)Jumping nlp curves: A review of natural language processing research [review article] IEEE computational intelligence magazine 9 48-22
[5]  
Hemalatha KL(2011)Cross lingual handwritten character recognition using long short term memory network with aid of elephant herding optimization algorithm Pattern Recogn Lett 159 16-172
[6]  
Guptha N(2020)Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion Proceedings of the AAAI conference on artificial intelligence 34 164-2525
[7]  
Patil K(2018)Visual sentiment prediction based on automatic discovery of affective regions IEEE transactions on multimedia 20 2513-37
[8]  
Guptha NS(2019)Image-text sentiment analysis via deep multimodal attentive fusion Knowledge-Based Systems 167 26-6899
[9]  
Patil KK(2019)Found in translation: Learning robust joint representations by cyclic translations between modalities Proceedings of the AAAI conference on artificial intelligence 33 6892-8999
[10]  
Cambria E(2020)Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis Proceedings of the AAAI conference on artificial intelligence 34 8992-24