Multi-Modal Representation Learning with Text-Driven Soft Masks

被引:3
|
作者
Park, Jaeyoo [1 ]
Han, Bohyung [1 ,2 ]
机构
[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea
[2] Seoul Natl Univ, IPAI, Seoul, South Korea
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
关键词
D O I
10.1109/CVPR52729.2023.00274
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.
引用
收藏
页码:2798 / 2807
页数:10
相关论文
共 50 条
  • [21] Multi-Modal Transportation Recommendation with Unified Route Representation Learning
    Liu, Hao
    Han, Jindong
    Fu, Yanjie
    Zhou, Jingbo
    Lu, Xinjiang
    Xiong, Hui
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (03): : 342 - 350
  • [22] Graph Embedding Contrastive Multi-Modal Representation Learning for Clustering
    Xia, Wei
    Wang, Tianxiu
    Gao, Quanxue
    Yang, Ming
    Gao, Xinbo
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 1170 - 1183
  • [23] MULTI-MODAL REPRESENTATION LEARNING FOR SHORT VIDEO UNDERSTANDING AND RECOMMENDATION
    Guo, Daya
    Hong, Jiangshui
    Luo, Binli
    Yan, Qirui
    Niu, Zhangming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 687 - 690
  • [24] Towards a systematic multi-modal representation learning for network data
    Ben Houidi, Zied
    Azorin, Raphael
    Gallo, Massimo
    Finamore, Alessandro
    Rossi, Dario
    THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 181 - 187
  • [25] Multi-modal Representation Learning for Video Advertisement Content Structuring
    Guo, Daya
    Zeng, Zhaoyang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4770 - 4774
  • [26] Efficient disentangled representation learning for multi-modal finger biometrics
    Yang, Weili
    Huang, Junduan
    Luo, Dacan
    Kang, Wenxiong
    PATTERN RECOGNITION, 2024, 145
  • [27] Learning Multi-Modal Word Representation Grounded in Visual Context
    Zablocki, Eloi
    Piwowarski, Benjamin
    Soulier, Laure
    Gallinari, Patrick
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5626 - 5633
  • [28] Multi-modal anchor adaptation learning for multi-modal summarization
    Chen, Zhongfeng
    Lu, Zhenyu
    Rong, Huan
    Zhao, Chuanjun
    Xu, Fan
    NEUROCOMPUTING, 2024, 570
  • [29] SSDMM-VAE: variational multi-modal disentangled representation learning
    Mondal, Arnab Kumar
    Sailopal, Ajay
    Singla, Parag
    Ap, Prathosh
    APPLIED INTELLIGENCE, 2023, 53 (07) : 8467 - 8481
  • [30] MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning
    Nedungadi, Vishal
    Kariryaa, Ankit
    Oehmcke, Stefan
    Belongie, Serge
    Igel, Christian
    Lang, Nico
    COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 164 - 182