Enhanced Emotion Recognition Through Multimodal Fusion Using TriModal Fusion Graph Convolutional Networks

被引：1

作者：

Li Maoheng ^{[1
]}

机构：

[1] Guangdong Med Univ, Affiliated Hosp, Dept Informat Technol, Zhanjiang, Peoples R China

来源：

2024 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN 2024 | 2024年

关键词：

Emotion Recognition; Cross-model Attention; Multimodal Fusion; Graph Convolutional Networks;

D O I：

10.1109/IJCNN60899.2024.10650481

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the field of machine learning, self-supervised pre-training has made significant advancements, particularly in speech, vision, and natural language processing (NLP). This study introduces the TriModal Fusion Graph Convolutional Network (TFGCN), a novel framework that leverages the strengths of self-supervised pre-trained models to extract features across three modalities: text, audio, and visual, for emotion recognition.For text, we utilize GloVe, a method for generating rich word embeddings that provide nuanced contextual representations. In the audio domain, we employ Wav2Vec to extract speech features, while CLIP is used to capture intricate visual features.Our methodology focuses on the sophisticated use of cross-modal attention mechanisms, which refine the alignment and extraction of interactive information between the features of all three modalities. This cross-modal attention, combined with Graph Convolutional Networks (GCN), enables the fusion of multimodal information, capturing complex patterns and interactions crucial for accurate emotion recogni-tion.The proposed TFGCN architecture outperforms existing models, achieving a 2.92% absolute improvement in accuracy compared to the previous state-of-the-art on the benchmark IEMOCAP dataset. Additionally, we conduct experiments within each individual modality-text, audio, and visual-and compare these unimodal results with the multimodal approach, demonstrating the significant gains achieved through our multimodal fusion strategy.

引用

页数：9

共 28 条

[1] Multimodal Machine Learning: A Survey and Taxonomy [J].

Baltrusaitis, Tadas ;

Ahuja, Chaitanya ;

Morency, Louis-Philippe .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443

[2]

Cao R., 2021, P FUT TECHN C FTC 20, V1

[3]

Changizi M., 2023, P 13 INT C COMP KNOW, P067, DOI [10.1109/ICCKE60553.2023.10326222, DOI 10.1109/ICCKE60553.2023.10326222]

[4] MS2-GNN: Exploring GNN-Based Multimodal Fusion Network for Depression Detection [J].

Chen, Tao ;

Hong, Richang ;

Guo, Yanrong ;

Hao, Shijie ;

Hu, Bin .

IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (12) :7749-7759

[5]

Deng Z., 2019, P IEEE CVF INT C COM

[6] A multi-scale multi-model deep neural network via ensemble strategy on high-throughput microscopy image for protein subcellular localization [J].

Ding, Jiaqi ;

Xu, Junhai ;

Wei, Jianguo ;

Tang, Jijun ;

Guo, Fei .

EXPERT SYSTEMS WITH APPLICATIONS, 2023, 212

[7] Can quantum-mechanical description of physical reality be considered complete? [J].

Einstein, A ;

Podolsky, B ;

Rosen, N .

PHYSICAL REVIEW, 1935, 47 (10) :0777-0780

[8] Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions [J].

Gandhi, Ankita ;

Adhvaryu, Kinjal ;

Poria, Soujanya ;

Cambria, Erik ;

Hussain, Amir .

INFORMATION FUSION, 2023, 91 :424-444

[9]

Ghosal D., 2019, arXiv

[10]

Gladys A., 2023, NEUROCOMPUTING

← 1 2 3 →