Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer

被引:0
|
作者
Liu, Tengfei [1 ]
Hu, Yongli [1 ]
Gao, Junbin [2 ]
Wang, Jiapu [1 ]
Sun, Yanfeng [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[2] Univ Sydney, Business Sch, Discipline Business Analyt, Sydney, NSW 2006, Australia
基金
中国国家自然科学基金;
关键词
Multi-modal long document classification; Multi-modal transformer; Prompt learning; Multi-scale multi-modal transformer;
D O I
10.1016/j.neunet.2024.106322
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi- modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Hadamard matrix-guided multi-modal hashing for multi-modal retrieval
    Yu, Jun
    Huang, Wei
    Li, Zuhe
    Shu, Zhenqiu
    Zhu, Liang
    DIGITAL SIGNAL PROCESSING, 2022, 130
  • [42] Temporally Language Grounding With Multi-Modal Multi-Prompt Tuning
    Zeng, Yawen
    Han, Ning
    Pan, Keyu
    Jin, Qin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3366 - 3377
  • [43] Conversational multi-modal browser: An integrated multi-modal browser and dialog manager
    Tiwari, A
    Hosn, RA
    Maes, SH
    2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2003, : 348 - 351
  • [44] Transformer-based Automatic Music Mood Classification Using Multi-modal Framework
    Kumar, Sujeesha Ajithakumari Suresh
    Rajan, Rajeev
    JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2023, 23 (01): : 18 - 34
  • [45] TRANSFORMER-BASED MULTI-MODAL LEARNING FOR MULTI-LABEL REMOTE SENSING IMAGE CLASSIFICATION
    Hoffmann, David Sebastian
    Clasen, Kai Norman
    Demir, Begum
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 4891 - 4894
  • [46] Improved Sentiment Classification by Multi-modal Fusion
    Gan, Lige
    Benlamri, Rachid
    Khoury, Richard
    2017 THIRD IEEE INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2017), 2017, : 11 - 16
  • [47] Multi-modal Hierarchical Transformer for Occupancy Flow Field Prediction in Autonomous Driving
    Liu, Haochen
    Huang, Zhiyu
    Lv, Chen
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 1449 - 1455
  • [48] Multi-modal classification in digital news libraries
    Chen, MY
    Hauptmann, A
    JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 212 - 213
  • [49] Multi-modal Music Genre Classification Approach
    Zhen, Chao
    Xu, Jieping
    PROCEEDINGS OF 2010 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (ICCSIT 2010), VOL 8, 2010, : 398 - 402
  • [50] Toward Multi-modal Music Emotion Classification
    Yang, Yi-Hsuan
    Lin, Yu-Ching
    Cheng, Heng-Tze
    Liao, I-Bin
    Ho, Yeh-Chin
    Chen, Homer H.
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2008, 9TH PACIFIC RIM CONFERENCE ON MULTIMEDIA, 2008, 5353 : 70 - +