Multi-Modal Representation Learning with Text-Driven Soft Masks

被引:3
作者
Park, Jaeyoo [1 ]
Han, Bohyung [1 ,2 ]
机构
[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea
[2] Seoul Natl Univ, IPAI, Seoul, South Korea
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
关键词
D O I
10.1109/CVPR52729.2023.00274
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.
引用
收藏
页码:2798 / 2807
页数:10
相关论文
共 50 条
  • [41] Incomplete multi-modal representation learning for Alzheimer's disease diagnosis
    Liu, Yanbei
    Fan, Lianxi
    Zhang, Changqing
    Zhou, Tao
    Xiao, Zhitao
    Geng, Lei
    Shen, Dinggang
    MEDICAL IMAGE ANALYSIS, 2021, 69
  • [42] Deep Multi-modal Latent Representation Learning for Automated Dementia Diagnosis
    Zhou, Tao
    Liu, Mingxia
    Fu, Huazhu
    Wang, Jun
    Shen, Jianbing
    Shao, Ling
    Shen, Dinggang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT IV, 2019, 11767 : 629 - 638
  • [43] TeSTNeRF: Text-Driven 3D Style Transfer via Cross-Modal Learning
    Chen, Jiafu
    Ji, Boyan
    Zhang, Zhanjie
    Chu, Tianyi
    Zuo, Zhiwen
    Zhao, Lei
    Xing, Wei
    Lu, Dongming
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5788 - 5796
  • [44] CLMTR: a generic framework for contrastive multi-modal trajectory representation learning
    Liang, Anqi
    Yao, Bin
    Xie, Jiong
    Zheng, Wenli
    Shen, Yanyan
    Ge, Qiqi
    GEOINFORMATICA, 2024, : 233 - 253
  • [45] Online Multi-modal Task-Driven Dictionary Learning and Robust Joint Sparse Representation for Visual Tracking
    Taalimi, Ali
    Qi, Hairong
    Khorsandi, Rahman
    2015 12TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2015,
  • [46] Unsupervised Multi-modal Learning
    Iqbal, Mohammed Shameer
    ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091 : 343 - 346
  • [47] Learning Multi-modal Similarity
    McFee, Brian
    Lanckriet, Gert
    JOURNAL OF MACHINE LEARNING RESEARCH, 2011, 12 : 491 - 523
  • [48] Multi-modal Feedback for Affordance-driven Interactive Reinforcement Learning
    Cruz, Francisco
    Parisi, German, I
    Wermter, Stefan
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [49] DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency
    Yao, Wenfang
    Yin, Kejing
    Cheung, William K.
    Liu, Jia
    Qin, Jing
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16416 - 16424
  • [50] Multi-Region Text-Driven Manipulation of Diffusion Imagery
    Li, Yiming
    Zhou, Peng
    Sun, Jun
    Xu, Yi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3261 - 3269