Synthesizing Counterfactual Samples for Effective Image-Text Matching

被引:4
作者
Wei, Hao [1 ,2 ]
Wang, Shuhui [1 ,3 ]
Han, Xinzhe [1 ,2 ]
Xue, Zhe [4 ]
Ma, Bin [5 ]
Wei, Xiaoming [5 ]
Wei, Xiaolin [5 ]
机构
[1] Chinese Acad Sci, Inst Comput Tech, Key Lab Intell Info Proc, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
[4] BUPT, Beijing Key Lab Intelligent Telecommun Software &, Beijing, Peoples R China
[5] Meituan Inc, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Image-Text Matching; Hard Negative Mining; Causal Effects; Counterfactual Reasoning; SIMILARITY;
D O I
10.1145/3503161.3547814
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Image-text matching is a fundamental research topic bridging vision and language. Recent works use hard negative mining to capture the multiple correspondences between visual and textual domains. Unfortunately, the truly informative negative samples are quite sparse in the training data, which are hard to obtain only in a randomly sampled mini-batch. Motivated by causal inference, we aim to overcome this shortcoming by carefully analyzing the analogy between hard negative mining and causal effects optimizing. Further, we propose Counterfactual Matching (CFM) framework for more effective image-text correspondence mining. CFM contains three major components, i.e., Gradient-Guided Feature Selection for automatic casual factor identification, Self-Exploration for causal factor completeness, and Self-Adjustment for counterfactual sample synthesis. Compared with traditional hard negative mining, our method largely alleviates the over-fitting phenomenon and effectively captures the fine-grained correlations between image and text modality. We evaluate our CFM in combination with three state-of-the-art image-text matching architectures. Quantitative and qualitative experiments conducted on two publicly available datasets demonstrate its strong generality and effectiveness. Code is available at https://github.com/weihao20/cfm.
引用
收藏
页码:4355 / 4364
页数:10
相关论文
共 73 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] [Anonymous], 2020, C COMP VIS PATT REC, DOI DOI 10.1109/IPEMC-ECCEASIA48364.2020.9367886
  • [3] Besserve M., 2020, 8 INT C LEARN REPR I
  • [4] Besserve Michel, 2020, 8 INT C LEARN REPR I
  • [5] Bosi I, 2016, 2016 INTERNATIONAL MULTIDISCIPLINARY CONFERENCE ON COMPUTER AND ENERGY SCIENCE (SPLITECH), P1
  • [6] Chechik G, 2010, J MACH LEARN RES, V11, P1109
  • [7] Learning the Best Pooling Strategy for Visual Semantic Embedding
    Chen, Jiacheng
    Hu, Hexiang
    Wu, Hao
    Jiang, Yuning
    Wang, Changhu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15784 - 15793
  • [8] Chen L, 2020, PROC CVPR IEEE, P10797, DOI 10.1109/CVPR42600.2020.01081
  • [9] Chen TL, 2020, AAAI CONF ARTIF INTE, V34, P10583
  • [10] Beyond triplet loss: a deep quadruplet network for person re-identification
    Chen, Weihua
    Chen, Xiaotang
    Zhang, Jianguo
    Huang, Kaiqi
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1320 - 1329