Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer

被引:0
|
作者
Liu, Tengfei [1 ]
Hu, Yongli [1 ]
Gao, Junbin [2 ]
Wang, Jiapu [1 ]
Sun, Yanfeng [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[2] Univ Sydney, Business Sch, Discipline Business Analyt, Sydney, NSW 2006, Australia
基金
中国国家自然科学基金;
关键词
Multi-modal long document classification; Multi-modal transformer; Prompt learning; Multi-scale multi-modal transformer;
D O I
10.1016/j.neunet.2024.106322
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi- modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification
    Liu, Tengfei
    Hu, Yongli
    Gao, Junbin
    Sun, Yanfeng
    Yin, Baocai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6376 - 6390
  • [2] HM-Transformer: Hierarchical Multi-modal Transformer for Long Document Image Understanding
    Deng, Xi
    Li, Shasha
    Yu, Jie
    Ma, Jun
    WEB AND BIG DATA, PT IV, APWEB-WAIM 2023, 2024, 14334 : 232 - 245
  • [3] Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity Alignment
    Li, Qian
    Ji, Cheng
    Guo, Shu
    Liang, Zhaoji
    Wang, Lihong
    Li, Jianxin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 987 - 999
  • [4] Visual Prompt Multi-Modal Tracking
    Zhu, Jiawen
    Lai, Simiao
    Chen, Xin
    Wang, Dong
    Lu, Huchuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 9516 - 9526
  • [5] MaPLe: Multi-modal Prompt Learning
    Khattak, Muhammad Uzair
    Rasheed, Hanoona
    Maaz, Muhammad
    Khan, Salman
    Khan, Fahad Shahbaz
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19113 - 19122
  • [6] A MULTI-MODAL TRANSFORMER APPROACH FOR FOOTBALL EVENT CLASSIFICATION
    Zhang, Yixiao
    Li, Baihua
    Fang, Hui
    Meng, Qinggang
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2220 - 2224
  • [7] A Multi-Modal Multilingual Benchmark for Document Image Classification
    Fujinuma, Yoshinari
    Varia, Siddharth
    Sankaran, Nishant
    Min, Bonan
    Appalaraju, Srikar
    Vyas, Yogarshi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14361 - 14376
  • [8] RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation
    Wang, Yan
    Zeng, Yawen
    Liang, Junjie
    Xing, Xiaofen
    Xu, Jin
    Xu, Xiangmin
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 860 - 868
  • [9] Multi-modal Extreme Classification
    Mittal, Anshul
    Dahiya, Kunal
    Malani, Shreya
    Ramaswamy, Janani
    Kuruvilla, Seba
    Ajmera, Jitendra
    Chang, Keng-Hao
    Agarwal, Sumeet
    Kar, Purushottam
    Varma, Manik
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12383 - 12392
  • [10] Landmark Classification With Hierarchical Multi-Modal Exemplar Feature
    Zhu, Lei
    Shen, Jialie
    Jin, Hai
    Xie, Liang
    Zheng, Ran
    IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (07) : 981 - 993