Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification

被引：3

作者：

Liu, Tengfei ^{[1
]}

Hu, Yongli ^{[1
]}

Gao, Junbin

Sun, Yanfeng

Yin, Baocai

机构：

[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Transformers; Task analysis; Feature extraction; Visualization; Circuits and systems; Adaptation models; Computational modeling; Multi-modal long document classification; multi-modal transformer; adaptive multi-scale multi-modal transformer; prompt learning;

D O I：

10.1109/TCSVT.2024.3366935

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.

引用

页码：6376 / 6390

页数：15

共 50 条

[21] Knowledge Synergy Learning for Multi-Modal Tracking
He, Yuhang
Ma, Zhiheng
Wei, Xing
Gong, Yihong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5519 - 5532
[22] Multi-Modal Multi-Grained Embedding Learning for Generalized Zero-Shot Video Classification
Hong, Mingyao
Zhang, Xinfeng
Li, Guorong
Huang, Qingming
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5959 - 5972
[23] Metaknowledge Extraction Based on Multi-Modal Documents
Liu, Shu-Kan
Xu, Rui-Lin
Geng, Bo-Ying
Sun, Qiao
Duan, Li
Liu, Yi-Ming
IEEE ACCESS, 2021, 9 : 50050 - 50060
[24] A Multi-Modal ELMo Model for Image Sentiment Recognition of Consumer Data
Rong, Lu
Ding, Yijie
Wang, Mengyao
El Saddik, Abdulmotaleb
Hossain, M. Shamim
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (01) : 3697 - 3708
[25] Multi-Modal Multi-Channel Target Speech Separation
Gu, Rongzhi
Zhang, Shi-Xiong
Xu, Yong
Chen, Lianwu
Zou, Yuexian
Yu, Dong
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 530 - 541
[26] Depth for Multi-Modal Contour Ensembles
Chaves-de-Plaza, N. F.
Molenaar, M.
Mody, P.
Staring, M.
van Egmond, R.
Eisemann, E.
Vilanova, A.
Hildebrandt, K.
COMPUTER GRAPHICS FORUM, 2024, 43 (03)
[27] Multi-Modal Hierarchical Empathetic Framework for Social Robots With Affective Body Control
Gao, Yue
Fu, Yangqing
Sun, Ming
Gao, Feng
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (03) : 1621 - 1633
[28] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
Siebert, Tim
Clasen, Kai Norman
Ravanbakhsh, Mahdyar
Demir, Beguem
IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
[29] MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model With Multi-Modal Transformer
Zhu, Dandan
Zhu, Kun
Ding, Weiping
Zhang, Nana
Min, Xiongkuo
Zhai, Guangtao
Yang, Xiaokang
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1756 - 1771
[30] MATNet: Exploiting Multi-Modal Features for Radiology Report Generation
Shang, Caozhi
Cui, Shaoguo
Li, Tiansong
Wang, Xi
Li, Yongmei
Jiang, Jingfeng
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2692 - 2696

← 1 2 3 4 5 →