Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning

被引：0

作者：

Shi, Qianyao ^{[1
]}

Xu, Wanru

Miao, Zhenjiang ^{[2
]}

机构：

[1] Beijing Jiaotong Univ, Informat & Commun Engn, Beijing, Peoples R China

[2] Beijing Jiaotong Univ, Media Comp Ctr, Beijing, Peoples R China

来源：

JOURNAL OF ELECTRONIC IMAGING | 2024年 / 33卷 / 04期

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

multimodal classification; cross-attention; contextual transformer; modality-collaborative;

D O I：

10.1117/1.JEI.33.4.043042

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model's discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/. (c) 2024 SPIE and IS&T

引用

页数：23

共 60 条

[1] Abavisani M, 2020, PROC CVPR IEEE, P14667, DOI 10.1109/CVPR42600.2020.01469
[2] Agarwal M, 2020, AAAI CONF ARTIF INTE, V34, P346
[3] Alam F., 2018, CRISISMMD MULTIMODAL
[4] Multimodal Machine Learning: A Survey and Taxonomy
Baltrusaitis, Tadas
Ahuja, Chaitanya
Morency, Louis-Philippe
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
[5] Learning What and When to Drop: Adaptive Multimodal and Contextual Dynamics for Emotion Recognition in Conversation
Chen, Feiyu
Sun, Zhengxiao
Ouyang, Deqiang
Liu, Xueliang
Shao, Jie
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1064 - 1073
[6] Chen S., 2015, P 5 INT WORKSH AUD V, P49, DOI DOI 10.1145/2808196.2811638
[7] Chen Y., 2021, arXiv
[8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9] Fukui A., 2016, ARXIV160601847
[10] Image and Text fusion for UPMC Food-101 using BERT and CNNs
Gallo, Ignazio
Ria, Gianmarco
Landro, Nicola
La Grassa, Riccardo
[J]. 2020 35TH INTERNATIONAL CONFERENCE ON IMAGE AND VISION COMPUTING NEW ZEALAND (IVCNZ), 2020,

← 1 2 3 4 5 6 →