Multi-Modal Representation Learning with Text-Driven Soft Masks

被引：3

作者：

Park, Jaeyoo ^{[1
]}

Han, Bohyung ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea

[2] Seoul Natl Univ, IPAI, Seoul, South Korea

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00274

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

引用

页码：2798 / 2807

页数：10

共 50 条

[41] Incomplete multi-modal representation learning for Alzheimer's disease diagnosis
Liu, Yanbei
Fan, Lianxi
Zhang, Changqing
Zhou, Tao
Xiao, Zhitao
Geng, Lei
Shen, Dinggang
MEDICAL IMAGE ANALYSIS, 2021, 69
[42] Deep Multi-modal Latent Representation Learning for Automated Dementia Diagnosis
Zhou, Tao
Liu, Mingxia
Fu, Huazhu
Wang, Jun
Shen, Jianbing
Shao, Ling
Shen, Dinggang
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT IV, 2019, 11767 : 629 - 638
[43] TeSTNeRF: Text-Driven 3D Style Transfer via Cross-Modal Learning
Chen, Jiafu
Ji, Boyan
Zhang, Zhanjie
Chu, Tianyi
Zuo, Zhiwen
Zhao, Lei
Xing, Wei
Lu, Dongming
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5788 - 5796
[44] CLMTR: a generic framework for contrastive multi-modal trajectory representation learning
Liang, Anqi
Yao, Bin
Xie, Jiong
Zheng, Wenli
Shen, Yanyan
Ge, Qiqi
GEOINFORMATICA, 2024, : 233 - 253
[45] Online Multi-modal Task-Driven Dictionary Learning and Robust Joint Sparse Representation for Visual Tracking
Taalimi, Ali
Qi, Hairong
Khorsandi, Rahman
2015 12TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2015,
[46] Unsupervised Multi-modal Learning
Iqbal, Mohammed Shameer
ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091 : 343 - 346
[47] Learning Multi-modal Similarity
McFee, Brian
Lanckriet, Gert
JOURNAL OF MACHINE LEARNING RESEARCH, 2011, 12 : 491 - 523
[48] Multi-modal Feedback for Affordance-driven Interactive Reinforcement Learning
Cruz, Francisco
Parisi, German, I
Wermter, Stefan
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
[49] DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency
Yao, Wenfang
Yin, Kejing
Cheung, William K.
Liu, Jia
Qin, Jing
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16416 - 16424
[50] Multi-Region Text-Driven Manipulation of Diffusion Imagery
Li, Yiming
Zhou, Peng
Sun, Jun
Xu, Yi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3261 - 3269

← 1 2 3 4 5 →