Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer

被引：0

作者：

Liu, Tengfei ^{[1
]}

Hu, Yongli ^{[1
]}

Gao, Junbin ^{[2
]}

Wang, Jiapu ^{[1
]}

Sun, Yanfeng ^{[1
]}

Yin, Baocai ^{[1
]}

机构：

[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China

[2] Univ Sydney, Business Sch, Discipline Business Analyt, Sydney, NSW 2006, Australia

来源：

NEURAL NETWORKS | 2024年 / 176卷

基金：

中国国家自然科学基金;

关键词：

Multi-modal long document classification; Multi-modal transformer; Prompt learning; Multi-scale multi-modal transformer;

D O I：

10.1016/j.neunet.2024.106322

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi- modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.

引用

页数：13

共 50 条

[1] Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification
Liu, Tengfei
Hu, Yongli
Gao, Junbin
Sun, Yanfeng
Yin, Baocai
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6376 - 6390
[2] HM-Transformer: Hierarchical Multi-modal Transformer for Long Document Image Understanding
Deng, Xi
Li, Shasha
Yu, Jie
Ma, Jun
WEB AND BIG DATA, PT IV, APWEB-WAIM 2023, 2024, 14334 : 232 - 245
[3] Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity Alignment
Li, Qian
Ji, Cheng
Guo, Shu
Liang, Zhaoji
Wang, Lihong
Li, Jianxin
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 987 - 999
[4] Visual Prompt Multi-Modal Tracking
Zhu, Jiawen
Lai, Simiao
Chen, Xin
Wang, Dong
Lu, Huchuan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 9516 - 9526
[5] MaPLe: Multi-modal Prompt Learning
Khattak, Muhammad Uzair
Rasheed, Hanoona
Maaz, Muhammad
Khan, Salman
Khan, Fahad Shahbaz
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19113 - 19122
[6] A MULTI-MODAL TRANSFORMER APPROACH FOR FOOTBALL EVENT CLASSIFICATION
Zhang, Yixiao
Li, Baihua
Fang, Hui
Meng, Qinggang
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2220 - 2224
[7] A Multi-Modal Multilingual Benchmark for Document Image Classification
Fujinuma, Yoshinari
Varia, Siddharth
Sankaran, Nishant
Min, Bonan
Appalaraju, Srikar
Vyas, Yogarshi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14361 - 14376
[8] RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation
Wang, Yan
Zeng, Yawen
Liang, Junjie
Xing, Xiaofen
Xu, Jin
Xu, Xiangmin
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 860 - 868
[9] Multi-modal Extreme Classification
Mittal, Anshul
Dahiya, Kunal
Malani, Shreya
Ramaswamy, Janani
Kuruvilla, Seba
Ajmera, Jitendra
Chang, Keng-Hao
Agarwal, Sumeet
Kar, Purushottam
Varma, Manik
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12383 - 12392
[10] Landmark Classification With Hierarchical Multi-Modal Exemplar Feature
Zhu, Lei
Shen, Jialie
Jin, Hai
Xie, Liang
Zheng, Ran
IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (07) : 981 - 993

← 1 2 3 4 5 →