Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification

被引:3
|
作者
Liu, Tengfei [1 ]
Hu, Yongli [1 ]
Gao, Junbin
Sun, Yanfeng
Yin, Baocai
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Task analysis; Feature extraction; Visualization; Circuits and systems; Adaptation models; Computational modeling; Multi-modal long document classification; multi-modal transformer; adaptive multi-scale multi-modal transformer; prompt learning;
D O I
10.1109/TCSVT.2024.3366935
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.
引用
收藏
页码:6376 / 6390
页数:15
相关论文
共 50 条
  • [1] Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer
    Liu, Tengfei
    Hu, Yongli
    Gao, Junbin
    Wang, Jiapu
    Sun, Yanfeng
    Yin, Baocai
    NEURAL NETWORKS, 2024, 176
  • [2] HM-Transformer: Hierarchical Multi-modal Transformer for Long Document Image Understanding
    Deng, Xi
    Li, Shasha
    Yu, Jie
    Ma, Jun
    WEB AND BIG DATA, PT IV, APWEB-WAIM 2023, 2024, 14334 : 232 - 245
  • [3] Prompting for Multi-Modal Tracking
    Yang, Jinyu
    Li, Zhe
    Zheng, Feng
    Leonardis, Ales
    Song, Jingkuan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3492 - 3500
  • [4] Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity Alignment
    Li, Qian
    Ji, Cheng
    Guo, Shu
    Liang, Zhaoji
    Wang, Lihong
    Li, Jianxin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 987 - 999
  • [5] A MULTI-MODAL TRANSFORMER APPROACH FOR FOOTBALL EVENT CLASSIFICATION
    Zhang, Yixiao
    Li, Baihua
    Fang, Hui
    Meng, Qinggang
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2220 - 2224
  • [6] A Multi-Modal Multilingual Benchmark for Document Image Classification
    Fujinuma, Yoshinari
    Varia, Siddharth
    Sankaran, Nishant
    Min, Bonan
    Appalaraju, Srikar
    Vyas, Yogarshi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14361 - 14376
  • [7] Multi-modal Extreme Classification
    Mittal, Anshul
    Dahiya, Kunal
    Malani, Shreya
    Ramaswamy, Janani
    Kuruvilla, Seba
    Ajmera, Jitendra
    Chang, Keng-Hao
    Agarwal, Sumeet
    Kar, Purushottam
    Varma, Manik
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12383 - 12392
  • [8] Landmark Classification With Hierarchical Multi-Modal Exemplar Feature
    Zhu, Lei
    Shen, Jialie
    Jin, Hai
    Xie, Liang
    Zheng, Ran
    IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (07) : 981 - 993
  • [9] Transformer enabled multi-modal medical diagnosis for tuberculosis classification
    Kumar, Sachin
    Sharma, Shivani
    Megra, Kassahun Tadesse
    JOURNAL OF BIG DATA, 2025, 12 (01)
  • [10] Multi-modal mask Transformer network for social event classification
    Chen H.
    Qian S.
    Li Z.
    Fang Q.
    Xu C.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 579 - 587