Multi-Modal Representation Learning with Text-Driven Soft Masks

被引：3

作者：

Park, Jaeyoo ^{[1
]}

Han, Bohyung ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea

[2] Seoul Natl Univ, IPAI, Seoul, South Korea

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00274

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

引用

页码：2798 / 2807

页数：10

共 50 条

[21] Multi-Modal Transportation Recommendation with Unified Route Representation Learning
Liu, Hao
Han, Jindong
Fu, Yanjie
Zhou, Jingbo
Lu, Xinjiang
Xiong, Hui
PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (03): : 342 - 350
[22] Graph Embedding Contrastive Multi-Modal Representation Learning for Clustering
Xia, Wei
Wang, Tianxiu
Gao, Quanxue
Yang, Ming
Gao, Xinbo
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 1170 - 1183
[23] MULTI-MODAL REPRESENTATION LEARNING FOR SHORT VIDEO UNDERSTANDING AND RECOMMENDATION
Guo, Daya
Hong, Jiangshui
Luo, Binli
Yan, Qirui
Niu, Zhangming
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 687 - 690
[24] Towards a systematic multi-modal representation learning for network data
Ben Houidi, Zied
Azorin, Raphael
Gallo, Massimo
Finamore, Alessandro
Rossi, Dario
THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 181 - 187
[25] Multi-modal Representation Learning for Video Advertisement Content Structuring
Guo, Daya
Zeng, Zhaoyang
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4770 - 4774
[26] Efficient disentangled representation learning for multi-modal finger biometrics
Yang, Weili
Huang, Junduan
Luo, Dacan
Kang, Wenxiong
PATTERN RECOGNITION, 2024, 145
[27] Learning Multi-Modal Word Representation Grounded in Visual Context
Zablocki, Eloi
Piwowarski, Benjamin
Soulier, Laure
Gallinari, Patrick
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5626 - 5633
[28] Multi-modal anchor adaptation learning for multi-modal summarization
Chen, Zhongfeng
Lu, Zhenyu
Rong, Huan
Zhao, Chuanjun
Xu, Fan
NEUROCOMPUTING, 2024, 570
[29] SSDMM-VAE: variational multi-modal disentangled representation learning
Mondal, Arnab Kumar
Sailopal, Ajay
Singla, Parag
Ap, Prathosh
APPLIED INTELLIGENCE, 2023, 53 (07) : 8467 - 8481
[30] MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning
Nedungadi, Vishal
Kariryaa, Ankit
Oehmcke, Stefan
Belongie, Serge
Igel, Christian
Lang, Nico
COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 164 - 182

← 1 2 3 4 5 →