Automatic movie genre classification & emotion recognition via a BiProjection Multimodal Transformer

被引:1
作者
Moreno-Galvan, Diego Aaron [1 ]
Lopez-Santillan, Roberto [2 ]
Gonzalez-Gurrola, Luis Carlos [2 ]
Montes-Y-Gomez, Manuel [3 ]
Sanchez-Vega, Fernando [1 ,5 ]
Lopez-Monroy, Adrian Pastor [1 ,4 ]
机构
[1] Ctr Invest Matemat CIMAT, Dept Ciencias Comp, Jalisco S-N,Col Valenciana, Guanajuato 36023, Guanajuato, Mexico
[2] Univ Autonoma Chihuahua UACH, Fac Ingn, Circuito Univ Campus II, Chihuahua 31125, Chihuahua, Mexico
[3] Inst Nacl Astrofis Opt & Elect INAOE, Coordinac Ciencias Computac, Luis Enr Erro 1,Sta Ma Tonantzintla, Cholula 72840, Puebla, Mexico
[4] Univ Virtual Estado Guanajuato UVEG, Hermenegildo Bustos Numero 129 A Sur,Colonia Ctr, Purisima del Rincon 36400, Guanajuato, Mexico
[5] Consejo Nacl Human Ciencias & Tecnol CONAHCYT, Av Insurgentes 1582,Col Credito Constructor, Ciudad De Mexico 03940, Mexico
关键词
Fusion neural architectures; Multimodal classification; Transformers; Movie genre classification; Emotion recognition; Multimodal Transformer; BERT; GMU; Multimodal fusion;
D O I
10.1016/j.inffus.2024.102641
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Analyzing, manipulating, and comprehending data from multiple sources (e.g., websites, software applications, files, or databases) and of diverse modalities (e.g., video, images, audio and text) has become increasingly important in many domains. Despite recent advances in multimodal classification (MC), there are still several challenges to be addressed, such as: the combination of modalities of very diverse nature, the optimal feature engineering for each modality, as well as the semantic alignment between text and images. Accordingly, the main motivation of our research relies in devising a neural architecture that effectively processes and combines text, image, video and audio modalities, so it can offer a noteworthy performance in different MC tasks. In this regard, the Multimodal Transformer (MulT) model is a cutting-edge approach often employed in multimodal supervised tasks, which, although effective, has the problem of having a fixed architecture that limits its performance in specific tasks as well as its contextual understanding, meaning it may struggle to capture fine-grained temporal patterns in audio or effectively model spatial relationships in images. To address these issues, our research modifies and extends the MulT model in several aspects. Firstly, we focus on leveraging the Gated Multimodal Unit (GMU) module within the architecture to efficiently and dynamically weigh modalities at the instance level and to visualize the use of modalities. Secondly, to overcome the problem of vanishing and exploding gradients we focus on strategically placing residual connections in the architecture. The proposed architecture is evaluated in two different and complex classification tasks, on the one hand, the movie genre categorization (MGC) and, on the other hand, the multimodal emotion recognition (MER). The results obtained are encouraging as they indicate that the proposed architecture is competitive against stateof-the-art (SOTA) models in MGC, outperforming them by up to 2% on the Moviescope dataset, and by 1% on the MM-IMDB datasets. Furthermore, in the MER task the unaligned version of the datasets was employed, which is considerably more difficult; we improve accuracy SOTA results by up to 1% on the IEMOCAP dataset, and attained a competitive outcome on the CMU-MOSEI1 1 collection, outperforming SOTA results in several emotions.
引用
收藏
页数:15
相关论文
共 39 条
  • [1] Arevalo J., 2017, WORKSH TRACK ICLR
  • [2] Baltrusaitis T, 2017, Arxiv, DOI arXiv:1705.09406
  • [3] MovieCLIP: Visual Scene Recognition in Movies
    Bose, Digbalay
    Hebbar, Rajat
    Somandepalli, Krishna
    Zhang, Haoyang
    Cui, Yin
    Cole-McLaughlin, Kree
    Wang, Huisheng
    Narayanan, Shrikanth
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2082 - 2091
  • [4] Braz L., 2021, 11 INT C PATT REC SY, V2021, P200, DOI [10.1049/icp.2021.1456, DOI 10.1049/ICP.2021.1456]
  • [5] Multimodal Attentive Fusion Network for audio-visual event recognition
    Brousmiche, Mathilde
    Rouat, Jean
    Dupont, Stephane
    [J]. INFORMATION FUSION, 2022, 85 : 52 - 59
  • [6] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [7] Cascante-Bonilla P, 2019, Arxiv, DOI arXiv:1908.03180
  • [8] Assessing the Multiple Dimensions of Engagement to Characterize Learning: A Neurophysiological Perspective
    Charland, Patrick
    Leger, Pierre-Majorique
    Senecal, Sylvain
    Courtemanche, Francois
    Mercier, Julien
    Skelling, Yannick
    Labonte-Lemoyne, Elise
    [J]. JOVE-JOURNAL OF VISUALIZED EXPERIMENTS, 2015, (101): : 1 - 8
  • [9] Dai WL, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P5305
  • [10] Dai WL, 2020, 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), P269