Synthesizing Counterfactual Samples for Effective Image-Text Matching

被引：4

作者：

Wei, Hao ^{[1
,2
]}

Wang, Shuhui ^{[1
,3
]}

Han, Xinzhe ^{[1
,2
]}

Xue, Zhe ^{[4
]}

Ma, Bin ^{[5
]}

Wei, Xiaoming ^{[5
]}

Wei, Xiaolin ^{[5
]}

机构：

[1] Chinese Acad Sci, Inst Comput Tech, Key Lab Intell Info Proc, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

[4] BUPT, Beijing Key Lab Intelligent Telecommun Software &, Beijing, Peoples R China

[5] Meituan Inc, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Image-Text Matching; Hard Negative Mining; Causal Effects; Counterfactual Reasoning; SIMILARITY;

D O I：

10.1145/3503161.3547814

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Image-text matching is a fundamental research topic bridging vision and language. Recent works use hard negative mining to capture the multiple correspondences between visual and textual domains. Unfortunately, the truly informative negative samples are quite sparse in the training data, which are hard to obtain only in a randomly sampled mini-batch. Motivated by causal inference, we aim to overcome this shortcoming by carefully analyzing the analogy between hard negative mining and causal effects optimizing. Further, we propose Counterfactual Matching (CFM) framework for more effective image-text correspondence mining. CFM contains three major components, i.e., Gradient-Guided Feature Selection for automatic casual factor identification, Self-Exploration for causal factor completeness, and Self-Adjustment for counterfactual sample synthesis. Compared with traditional hard negative mining, our method largely alleviates the over-fitting phenomenon and effectively captures the fine-grained correlations between image and text modality. We evaluate our CFM in combination with three state-of-the-art image-text matching architectures. Quantitative and qualitative experiments conducted on two publicly available datasets demonstrate its strong generality and effectiveness. Code is available at https://github.com/weihao20/cfm.

引用

页码：4355 / 4364

页数：10

共 73 条

[1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[2] [Anonymous], 2020, C COMP VIS PATT REC, DOI DOI 10.1109/IPEMC-ECCEASIA48364.2020.9367886
[3] Besserve M., 2020, 8 INT C LEARN REPR I
[4] Besserve Michel, 2020, 8 INT C LEARN REPR I
[5] Bosi I, 2016, 2016 INTERNATIONAL MULTIDISCIPLINARY CONFERENCE ON COMPUTER AND ENERGY SCIENCE (SPLITECH), P1
[6] Chechik G, 2010, J MACH LEARN RES, V11, P1109
[7] Learning the Best Pooling Strategy for Visual Semantic Embedding
Chen, Jiacheng
Hu, Hexiang
Wu, Hao
Jiang, Yuning
Wang, Changhu
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15784 - 15793
[8] Chen L, 2020, PROC CVPR IEEE, P10797, DOI 10.1109/CVPR42600.2020.01081
[9] Chen TL, 2020, AAAI CONF ARTIF INTE, V34, P10583
[10] Beyond triplet loss: a deep quadruplet network for person re-identification
Chen, Weihua
Chen, Xiaotang
Zhang, Jianguo
Huang, Kaiqi
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1320 - 1329

← 1 2 3 4 5 6 7 8 →