Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer

被引:0
|
作者
Liu, Tengfei [1 ]
Hu, Yongli [1 ]
Gao, Junbin [2 ]
Wang, Jiapu [1 ]
Sun, Yanfeng [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[2] Univ Sydney, Business Sch, Discipline Business Analyt, Sydney, NSW 2006, Australia
基金
中国国家自然科学基金;
关键词
Multi-modal long document classification; Multi-modal transformer; Prompt learning; Multi-scale multi-modal transformer;
D O I
10.1016/j.neunet.2024.106322
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi- modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Multi-modal Transformer for Brain Tumor Segmentation
    Cho, Jihoon
    Park, Jinah
    BRAINLESION: GLIOMA, MULTIPLE SCLEROSIS, STROKE AND TRAUMATIC BRAIN INJURIES, BRAINLES 2022, 2023, 13769 : 138 - 148
  • [22] A Multi-Modal Transformer network for action detection
    Korban, Matthew
    Youngs, Peter
    Acton, Scott T.
    PATTERN RECOGNITION, 2023, 142
  • [23] Multi-Modal Adversarial Example Detection with Transformer
    Ding, Chaoyue
    Sun, Shiliang
    Zhao, Jing
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [24] Multi-modal transformer for fake news detection
    Yang, Pingping
    Ma, Jiachen
    Liu, Yong
    Liu, Meng
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (08) : 14699 - 14717
  • [25] Flexible Dual Multi-Modal Hashing for Incomplete Multi-Modal Retrieval
    Wei, Yuhong
    An, Junfeng
    INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2024,
  • [26] Multi-modal Information Integration for Document Retrieval
    Hassan, Ehtesham
    Chaudhury, Santanu
    Gopal, M.
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 1200 - 1204
  • [27] Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer
    Yuan, Zhaoquan
    Peng, Xiao
    Wu, Xiao
    Xu, Changsheng
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1313 - 1321
  • [28] Multi-Modal 2020: Multi-Modal Argumentation 30 Years Later
    Gilbert, Michael A.
    INFORMAL LOGIC, 2022, 42 (03): : 487 - 506
  • [29] Towards Flexible Multi-modal Document Models
    Inoue, Naoto
    Kikuchi, Kotaro
    Simo-Serra, Edgar
    Otani, Mayu
    Yamaguchi, Kota
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14287 - 14296
  • [30] Research on Emotion Classification Based on Multi-modal Fusion
    Xiang, Zhihua
    Radzi, Nor Haizan Mohamed
    Hashim, Haslina
    BAGHDAD SCIENCE JOURNAL, 2024, 21 (02) : 548 - 560