Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification

被引:3
作者
Liu, Tengfei [1 ]
Hu, Yongli [1 ]
Gao, Junbin
Sun, Yanfeng
Yin, Baocai
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Task analysis; Feature extraction; Visualization; Circuits and systems; Adaptation models; Computational modeling; Multi-modal long document classification; multi-modal transformer; adaptive multi-scale multi-modal transformer; prompt learning;
D O I
10.1109/TCSVT.2024.3366935
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.
引用
收藏
页码:6376 / 6390
页数:15
相关论文
共 50 条
  • [21] Knowledge Synergy Learning for Multi-Modal Tracking
    He, Yuhang
    Ma, Zhiheng
    Wei, Xing
    Gong, Yihong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5519 - 5532
  • [22] Multi-Modal Multi-Grained Embedding Learning for Generalized Zero-Shot Video Classification
    Hong, Mingyao
    Zhang, Xinfeng
    Li, Guorong
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5959 - 5972
  • [23] Metaknowledge Extraction Based on Multi-Modal Documents
    Liu, Shu-Kan
    Xu, Rui-Lin
    Geng, Bo-Ying
    Sun, Qiao
    Duan, Li
    Liu, Yi-Ming
    IEEE ACCESS, 2021, 9 : 50050 - 50060
  • [24] A Multi-Modal ELMo Model for Image Sentiment Recognition of Consumer Data
    Rong, Lu
    Ding, Yijie
    Wang, Mengyao
    El Saddik, Abdulmotaleb
    Hossain, M. Shamim
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (01) : 3697 - 3708
  • [25] Multi-Modal Multi-Channel Target Speech Separation
    Gu, Rongzhi
    Zhang, Shi-Xiong
    Xu, Yong
    Chen, Lianwu
    Zou, Yuexian
    Yu, Dong
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 530 - 541
  • [26] Depth for Multi-Modal Contour Ensembles
    Chaves-de-Plaza, N. F.
    Molenaar, M.
    Mody, P.
    Staring, M.
    van Egmond, R.
    Eisemann, E.
    Vilanova, A.
    Hildebrandt, K.
    COMPUTER GRAPHICS FORUM, 2024, 43 (03)
  • [27] Multi-Modal Hierarchical Empathetic Framework for Social Robots With Affective Body Control
    Gao, Yue
    Fu, Yangqing
    Sun, Ming
    Gao, Feng
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (03) : 1621 - 1633
  • [28] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [29] MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model With Multi-Modal Transformer
    Zhu, Dandan
    Zhu, Kun
    Ding, Weiping
    Zhang, Nana
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1756 - 1771
  • [30] MATNet: Exploiting Multi-Modal Features for Radiology Report Generation
    Shang, Caozhi
    Cui, Shaoguo
    Li, Tiansong
    Wang, Xi
    Li, Yongmei
    Jiang, Jingfeng
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2692 - 2696