Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification

被引:3
|
作者
Liu, Tengfei [1 ]
Hu, Yongli [1 ]
Gao, Junbin
Sun, Yanfeng
Yin, Baocai
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Task analysis; Feature extraction; Visualization; Circuits and systems; Adaptation models; Computational modeling; Multi-modal long document classification; multi-modal transformer; adaptive multi-scale multi-modal transformer; prompt learning;
D O I
10.1109/TCSVT.2024.3366935
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.
引用
收藏
页码:6376 / 6390
页数:15
相关论文
共 50 条
  • [31] Multi-modal Fusion
    Liu, Huaping
    Hussain, Amir
    Wang, Shuliang
    INFORMATION SCIENCES, 2018, 432 : 462 - 462
  • [32] Multi-modal perception
    Hollier, MP
    Rimell, AN
    Hands, DS
    Voelcker, RM
    BT TECHNOLOGY JOURNAL, 1999, 17 (01) : 35 - 46
  • [33] Multi-modal mapping
    Darran Yates
    Nature Reviews Neuroscience, 2016, 17 : 536 - 536
  • [34] Multi-Modal Attribute Prompting for Vision-Language Models
    Liu, Xin
    Wu, Jiamin
    Yang, Wenfei
    Zhou, Xu
    Zhang, Tianzhu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11579 - 11591
  • [35] Hadamard matrix-guided multi-modal hashing for multi-modal retrieval
    Yu, Jun
    Huang, Wei
    Li, Zuhe
    Shu, Zhenqiu
    Zhu, Liang
    DIGITAL SIGNAL PROCESSING, 2022, 130
  • [36] Conversational multi-modal browser: An integrated multi-modal browser and dialog manager
    Tiwari, A
    Hosn, RA
    Maes, SH
    2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2003, : 348 - 351
  • [37] Improved Sentiment Classification by Multi-modal Fusion
    Gan, Lige
    Benlamri, Rachid
    Khoury, Richard
    2017 THIRD IEEE INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2017), 2017, : 11 - 16
  • [38] Multi-modal Hierarchical Transformer for Occupancy Flow Field Prediction in Autonomous Driving
    Liu, Haochen
    Huang, Zhiyu
    Lv, Chen
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 1449 - 1455
  • [39] Multi-modal classification in digital news libraries
    Chen, MY
    Hauptmann, A
    JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 212 - 213
  • [40] Multi-modal Music Genre Classification Approach
    Zhen, Chao
    Xu, Jieping
    PROCEEDINGS OF 2010 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (ICCSIT 2010), VOL 8, 2010, : 398 - 402