Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification

被引：3

作者：

Liu, Tengfei ^{[1
]}

Hu, Yongli ^{[1
]}

Gao, Junbin

Sun, Yanfeng

Yin, Baocai

机构：

[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Transformers; Task analysis; Feature extraction; Visualization; Circuits and systems; Adaptation models; Computational modeling; Multi-modal long document classification; multi-modal transformer; adaptive multi-scale multi-modal transformer; prompt learning;

D O I：

10.1109/TCSVT.2024.3366935

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.

引用

页码：6376 / 6390

页数：15

共 50 条

[1] Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer
Liu, Tengfei
Hu, Yongli
Gao, Junbin
Wang, Jiapu
Sun, Yanfeng
Yin, Baocai
NEURAL NETWORKS, 2024, 176
[2] HM-Transformer: Hierarchical Multi-modal Transformer for Long Document Image Understanding
Deng, Xi
Li, Shasha
Yu, Jie
Ma, Jun
WEB AND BIG DATA, PT IV, APWEB-WAIM 2023, 2024, 14334 : 232 - 245
[3] Prompting for Multi-Modal Tracking
Yang, Jinyu
Li, Zhe
Zheng, Feng
Leonardis, Ales
Song, Jingkuan
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3492 - 3500
[4] Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity Alignment
Li, Qian
Ji, Cheng
Guo, Shu
Liang, Zhaoji
Wang, Lihong
Li, Jianxin
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 987 - 999
[5] A MULTI-MODAL TRANSFORMER APPROACH FOR FOOTBALL EVENT CLASSIFICATION
Zhang, Yixiao
Li, Baihua
Fang, Hui
Meng, Qinggang
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2220 - 2224
[6] A Multi-Modal Multilingual Benchmark for Document Image Classification
Fujinuma, Yoshinari
Varia, Siddharth
Sankaran, Nishant
Min, Bonan
Appalaraju, Srikar
Vyas, Yogarshi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14361 - 14376
[7] Multi-modal Extreme Classification
Mittal, Anshul
Dahiya, Kunal
Malani, Shreya
Ramaswamy, Janani
Kuruvilla, Seba
Ajmera, Jitendra
Chang, Keng-Hao
Agarwal, Sumeet
Kar, Purushottam
Varma, Manik
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12383 - 12392
[8] Landmark Classification With Hierarchical Multi-Modal Exemplar Feature
Zhu, Lei
Shen, Jialie
Jin, Hai
Xie, Liang
Zheng, Ran
IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (07) : 981 - 993
[9] Transformer enabled multi-modal medical diagnosis for tuberculosis classification
Kumar, Sachin
Sharma, Shivani
Megra, Kassahun Tadesse
JOURNAL OF BIG DATA, 2025, 12 (01)
[10] Multi-modal mask Transformer network for social event classification
Chen H.
Qian S.
Li Z.
Fang Q.
Xu C.
Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 579 - 587

← 1 2 3 4 5 →