Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer

被引:0
|
作者
Liu, Tengfei [1 ]
Hu, Yongli [1 ]
Gao, Junbin [2 ]
Wang, Jiapu [1 ]
Sun, Yanfeng [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[2] Univ Sydney, Business Sch, Discipline Business Analyt, Sydney, NSW 2006, Australia
基金
中国国家自然科学基金;
关键词
Multi-modal long document classification; Multi-modal transformer; Prompt learning; Multi-scale multi-modal transformer;
D O I
10.1016/j.neunet.2024.106322
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi- modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] BloomVQA: Assessing Hierarchical Multi-modal Comprehension
    Gone, Yunye
    Shrestha, Robik
    Claypoole, Jared
    Cogswell, Michael
    Ray, Arijit
    Kanan, Christopher
    Divakaran, Ajay
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14905 - 14918
  • [32] Multi-modal tree-based SVM classification
    Freeman, Cecille
    Kulic, Dana
    Basir, Otman
    2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1, 2013, : 65 - 71
  • [33] Multi-Modal Pedestrian Detection with Large Misalignment Based on Modal-Wise Regression and Multi-Modal IoU
    Wanchaitanawong, Napat
    Tanaka, Masayuki
    Shibata, Takashi
    Okutomi, Masatoshi
    PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
  • [34] Cross-Modal Retrieval Augmentation for Multi-Modal Classification
    Gur, Shir
    Neverova, Natalia
    Stauffer, Chris
    Lim, Ser-Nam
    Kiela, Douwe
    Reiter, Austin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 111 - 123
  • [35] Multi-modal Perception
    Kondo, T.
    Denshi Joho Tsushin Gakkai Shi/Journal of the Institute of Electronics, Information and Communications Engineers, 78 (12):
  • [36] Multi-modal mapping
    Yates, Darran
    NATURE REVIEWS NEUROSCIENCE, 2016, 17 (09) : 536 - 536
  • [37] Multi-modal perception
    BT Technol J, 1 (35-46):
  • [38] Multi-modal Fusion
    Liu, Huaping
    Hussain, Amir
    Wang, Shuliang
    INFORMATION SCIENCES, 2018, 432 : 462 - 462
  • [39] Multi-modal perception
    Hollier, MP
    Rimell, AN
    Hands, DS
    Voelcker, RM
    BT TECHNOLOGY JOURNAL, 1999, 17 (01) : 35 - 46
  • [40] Multi-modal mapping
    Darran Yates
    Nature Reviews Neuroscience, 2016, 17 : 536 - 536