Multi-Modal Representation Learning with Text-Driven Soft Masks

被引：3

作者：

Park, Jaeyoo ^{[1
]}

Han, Bohyung ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea

[2] Seoul Natl Univ, IPAI, Seoul, South Korea

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00274

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

引用

页码：2798 / 2807

页数：10

共 50 条

[11] Joint Representation Learning for Multi-Modal Transportation Recommendation
Liu, Hao
Li, Ting
Hu, Renjun
Fu, Yanjie
Gu, Jingjing
Xiong, Hui
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 1036 - 1043
[12] Deep contrastive representation learning for multi-modal clustering
Lu, Yang
Li, Qin
Zhang, Xiangdong
Gao, Quanxue
NEUROCOMPUTING, 2024, 581
[13] Supervised Multi-modal Dictionary Learning for Clothing Representation
Zhao, Qilu
Wang, Jiayan
Li, Zongmin
PROCEEDINGS OF THE FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS - MVA2017, 2017, : 51 - 54
[14] Contrastive Multi-Modal Knowledge Graph Representation Learning
Fang, Quan
Zhang, Xiaowei
Hu, Jun
Wu, Xian
Xu, Changsheng
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 8983 - 8996
[15] Enhanced Topic Modeling with Multi-modal Representation Learning
Zhang, Duoyi
Wang, Yue
Abul Bashar, Md
Nayak, Richi
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT I, 2023, 13935 : 393 - 404
[16] Editorial for Special Issue on Multi-modal Representation Learning
Fan, Deng-Ping
Barnes, Nick
Cheng, Ming-Ming
Van Gool, Luc
MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 615 - 616
[17] Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis
Dahmani, Sara
Colotte, Vincent
Girard, Valerian
Ouni, Slim
NEURAL NETWORKS, 2021, 141 (141) : 315 - 329
[18] Multi-modal Representation Learning for Social Post Location Inference
Dai, RuiTing
Luo, Jiayi
Luo, Xucheng
Mo, Lisi
Ma, Wanlun
Zhou, Fan
ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 6331 - 6336
[19] Tripartite interaction representation learning for multi-modal sentiment analysis
Wang, Binqiang
Dong, Gang
Zhao, Yaqian
Li, Rengang
Yin, Wenfeng
Lu, Lihua
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 268
[20] A Theoretical Analysis of Multi-Modal Representation Learning with Regular Functions
Vural, Elif
2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,

← 1 2 3 4 5 →