Automatic movie genre classification & emotion recognition via a BiProjection Multimodal Transformer

被引：1

作者：

Moreno-Galvan, Diego Aaron ^{[1
]}

Lopez-Santillan, Roberto ^{[2
]}

Gonzalez-Gurrola, Luis Carlos ^{[2
]}

Montes-Y-Gomez, Manuel ^{[3
]}

Sanchez-Vega, Fernando ^{[1
,5
]}

Lopez-Monroy, Adrian Pastor ^{[1
,4
]}

机构：

[1] Ctr Invest Matemat CIMAT, Dept Ciencias Comp, Jalisco S-N,Col Valenciana, Guanajuato 36023, Guanajuato, Mexico

[2] Univ Autonoma Chihuahua UACH, Fac Ingn, Circuito Univ Campus II, Chihuahua 31125, Chihuahua, Mexico

[3] Inst Nacl Astrofis Opt & Elect INAOE, Coordinac Ciencias Computac, Luis Enr Erro 1,Sta Ma Tonantzintla, Cholula 72840, Puebla, Mexico

[4] Univ Virtual Estado Guanajuato UVEG, Hermenegildo Bustos Numero 129 A Sur,Colonia Ctr, Purisima del Rincon 36400, Guanajuato, Mexico

[5] Consejo Nacl Human Ciencias & Tecnol CONAHCYT, Av Insurgentes 1582,Col Credito Constructor, Ciudad De Mexico 03940, Mexico

来源：

INFORMATION FUSION | 2025年 / 113卷

关键词：

Fusion neural architectures; Multimodal classification; Transformers; Movie genre classification; Emotion recognition; Multimodal Transformer; BERT; GMU; Multimodal fusion;

D O I：

10.1016/j.inffus.2024.102641

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Analyzing, manipulating, and comprehending data from multiple sources (e.g., websites, software applications, files, or databases) and of diverse modalities (e.g., video, images, audio and text) has become increasingly important in many domains. Despite recent advances in multimodal classification (MC), there are still several challenges to be addressed, such as: the combination of modalities of very diverse nature, the optimal feature engineering for each modality, as well as the semantic alignment between text and images. Accordingly, the main motivation of our research relies in devising a neural architecture that effectively processes and combines text, image, video and audio modalities, so it can offer a noteworthy performance in different MC tasks. In this regard, the Multimodal Transformer (MulT) model is a cutting-edge approach often employed in multimodal supervised tasks, which, although effective, has the problem of having a fixed architecture that limits its performance in specific tasks as well as its contextual understanding, meaning it may struggle to capture fine-grained temporal patterns in audio or effectively model spatial relationships in images. To address these issues, our research modifies and extends the MulT model in several aspects. Firstly, we focus on leveraging the Gated Multimodal Unit (GMU) module within the architecture to efficiently and dynamically weigh modalities at the instance level and to visualize the use of modalities. Secondly, to overcome the problem of vanishing and exploding gradients we focus on strategically placing residual connections in the architecture. The proposed architecture is evaluated in two different and complex classification tasks, on the one hand, the movie genre categorization (MGC) and, on the other hand, the multimodal emotion recognition (MER). The results obtained are encouraging as they indicate that the proposed architecture is competitive against stateof-the-art (SOTA) models in MGC, outperforming them by up to 2% on the Moviescope dataset, and by 1% on the MM-IMDB datasets. Furthermore, in the MER task the unaligned version of the datasets was employed, which is considerably more difficult; we improve accuracy SOTA results by up to 1% on the IEMOCAP dataset, and attained a competitive outcome on the CMU-MOSEI1 1 collection, outperforming SOTA results in several emotions.

引用

页数：15

共 39 条

[1] Arevalo J., 2017, WORKSH TRACK ICLR
[2] Baltrusaitis T, 2017, Arxiv, DOI arXiv:1705.09406
[3] MovieCLIP: Visual Scene Recognition in Movies
Bose, Digbalay
Hebbar, Rajat
Somandepalli, Krishna
Zhang, Haoyang
Cui, Yin
Cole-McLaughlin, Kree
Wang, Huisheng
Narayanan, Shrikanth
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2082 - 2091
[4] Braz L., 2021, 11 INT C PATT REC SY, V2021, P200, DOI [10.1049/icp.2021.1456, DOI 10.1049/ICP.2021.1456]
[5] Multimodal Attentive Fusion Network for audio-visual event recognition
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
[J]. INFORMATION FUSION, 2022, 85 : 52 - 59
[6] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[7] Cascante-Bonilla P, 2019, Arxiv, DOI arXiv:1908.03180
[8] Assessing the Multiple Dimensions of Engagement to Characterize Learning: A Neurophysiological Perspective
Charland, Patrick
Leger, Pierre-Majorique
Senecal, Sylvain
Courtemanche, Francois
Mercier, Julien
Skelling, Yannick
Labonte-Lemoyne, Elise
[J]. JOVE-JOURNAL OF VISUALIZED EXPERIMENTS, 2015, (101): : 1 - 8
[9] Dai WL, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P5305
[10] Dai WL, 2020, 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), P269

← 1 2 3 4 →