Multi-Modal Representation Learning with Text-Driven Soft Masks

被引:3
|
作者
Park, Jaeyoo [1 ]
Han, Bohyung [1 ,2 ]
机构
[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea
[2] Seoul Natl Univ, IPAI, Seoul, South Korea
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
关键词
D O I
10.1109/CVPR52729.2023.00274
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.
引用
收藏
页码:2798 / 2807
页数:10
相关论文
共 50 条
  • [11] Joint Representation Learning for Multi-Modal Transportation Recommendation
    Liu, Hao
    Li, Ting
    Hu, Renjun
    Fu, Yanjie
    Gu, Jingjing
    Xiong, Hui
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 1036 - 1043
  • [12] Deep contrastive representation learning for multi-modal clustering
    Lu, Yang
    Li, Qin
    Zhang, Xiangdong
    Gao, Quanxue
    NEUROCOMPUTING, 2024, 581
  • [13] Supervised Multi-modal Dictionary Learning for Clothing Representation
    Zhao, Qilu
    Wang, Jiayan
    Li, Zongmin
    PROCEEDINGS OF THE FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS - MVA2017, 2017, : 51 - 54
  • [14] Contrastive Multi-Modal Knowledge Graph Representation Learning
    Fang, Quan
    Zhang, Xiaowei
    Hu, Jun
    Wu, Xian
    Xu, Changsheng
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 8983 - 8996
  • [15] Enhanced Topic Modeling with Multi-modal Representation Learning
    Zhang, Duoyi
    Wang, Yue
    Abul Bashar, Md
    Nayak, Richi
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT I, 2023, 13935 : 393 - 404
  • [16] Editorial for Special Issue on Multi-modal Representation Learning
    Fan, Deng-Ping
    Barnes, Nick
    Cheng, Ming-Ming
    Van Gool, Luc
    MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 615 - 616
  • [17] Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis
    Dahmani, Sara
    Colotte, Vincent
    Girard, Valerian
    Ouni, Slim
    NEURAL NETWORKS, 2021, 141 (141) : 315 - 329
  • [18] Multi-modal Representation Learning for Social Post Location Inference
    Dai, RuiTing
    Luo, Jiayi
    Luo, Xucheng
    Mo, Lisi
    Ma, Wanlun
    Zhou, Fan
    ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 6331 - 6336
  • [19] Tripartite interaction representation learning for multi-modal sentiment analysis
    Wang, Binqiang
    Dong, Gang
    Zhao, Yaqian
    Li, Rengang
    Yin, Wenfeng
    Lu, Lihua
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 268
  • [20] A Theoretical Analysis of Multi-Modal Representation Learning with Regular Functions
    Vural, Elif
    2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,