Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification

被引：3

作者：

Liu, Tengfei ^{[1
]}

Hu, Yongli ^{[1
]}

Gao, Junbin

Sun, Yanfeng

Yin, Baocai

机构：

[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Transformers; Task analysis; Feature extraction; Visualization; Circuits and systems; Adaptation models; Computational modeling; Multi-modal long document classification; multi-modal transformer; adaptive multi-scale multi-modal transformer; prompt learning;

D O I：

10.1109/TCSVT.2024.3366935

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.

引用

页码：6376 / 6390

页数：15

共 50 条

[31] Multi-modal Fusion
Liu, Huaping
Hussain, Amir
Wang, Shuliang
INFORMATION SCIENCES, 2018, 432 : 462 - 462
[32] Multi-modal perception
Hollier, MP
Rimell, AN
Hands, DS
Voelcker, RM
BT TECHNOLOGY JOURNAL, 1999, 17 (01) : 35 - 46
[33] Multi-modal mapping
Darran Yates
Nature Reviews Neuroscience, 2016, 17 : 536 - 536
[34] Multi-Modal Attribute Prompting for Vision-Language Models
Liu, Xin
Wu, Jiamin
Yang, Wenfei
Zhou, Xu
Zhang, Tianzhu
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11579 - 11591
[35] Hadamard matrix-guided multi-modal hashing for multi-modal retrieval
Yu, Jun
Huang, Wei
Li, Zuhe
Shu, Zhenqiu
Zhu, Liang
DIGITAL SIGNAL PROCESSING, 2022, 130
[36] Conversational multi-modal browser: An integrated multi-modal browser and dialog manager
Tiwari, A
Hosn, RA
Maes, SH
2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2003, : 348 - 351
[37] Improved Sentiment Classification by Multi-modal Fusion
Gan, Lige
Benlamri, Rachid
Khoury, Richard
2017 THIRD IEEE INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2017), 2017, : 11 - 16
[38] Multi-modal Hierarchical Transformer for Occupancy Flow Field Prediction in Autonomous Driving
Liu, Haochen
Huang, Zhiyu
Lv, Chen
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 1449 - 1455
[39] Multi-modal classification in digital news libraries
Chen, MY
Hauptmann, A
JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 212 - 213
[40] Multi-modal Music Genre Classification Approach
Zhen, Chao
Xu, Jieping
PROCEEDINGS OF 2010 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (ICCSIT 2010), VOL 8, 2010, : 398 - 402

← 1 2 3 4 5 →