Revisiting Counterfactual Problems in Referring Expression Comprehension

被引:3
作者
Yu, Zhihan [1 ]
Li, Ruifan [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.01276
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional referring expression comprehension (REC) aims to locate the target referent in an image guided by a text query. Several previous methods have studied on the Counterfactual problem in REC (C-REC) where the objects for a given query cannot be found in the image. However, these methods focus on the overall image-text or specific attribute mismatch only. In this paper, we address the C-REC problem from a deep perspective of fine-grained attributes. To this aim, we first propose a fine-grained counterfactual sample generation method to construct C-REC datasets. Specifically, we leverage pre-trained language model such as BERT to modify the attribute words in the queries, obtaining the corresponding counterfactual samples. Furthermore, we propose a C-REC framework. We first adopt three encoders to extract image, text and attribute features. Then, our dual-branch attentive fusion module fuses these crossmodal features with two branches by an attention mechanism. At last, two prediction heads generate a bounding box and a counterfactual label, respectively. In addition, we incorporate contrastive learning with the generated counterfactual samples as negatives to enhance the counterfactual perception. Extensive experiments show that our framework achieves promising performance on both public REC datasets RefCOCO/+/g and our constructed C-REC datasets C-RefCOCO/+/g. The code and data are available at https://github.com/Glacier0012/CREC.
引用
收藏
页码:13438 / 13448
页数:11
相关论文
共 50 条
[1]   Vision-Only Robot Navigation in a Neural Radiance World [J].
Adamkiewicz, Michal ;
Chen, Timothy ;
Caccavale, Adam ;
Gardner, Rachel ;
Culbertson, Preston ;
Bohg, Jeannette ;
Schwager, Mac .
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (02) :4606-4613
[2]   Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing [J].
Agarwal, Vedika ;
Shetty, Rakshith ;
Fritz, Mario .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9687-9695
[3]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[4]  
Chen L, 2021, AAAI CONF ARTIF INTE, V35, P1036
[5]   Counterfactual Samples Synthesizing for Robust Visual Question Answering [J].
Chen, Long ;
Yan, Xin ;
Xiao, Jun ;
Zhang, Hanwang ;
Pu, Shiliang ;
Zhuang, Yueting .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10797-10806
[6]   Selective Comprehension for Referring Expression by Prebuilt Entity Dictionary with Modular Networks [J].
Cui, Enjie ;
Wang, Jianming ;
Liang, Jiayu ;
Jin, Guanghao .
KNOWLEDGE MANAGEMENT AND ACQUISITION FOR INTELLIGENT SYSTEMS (PKAW 2018), 2018, 11016 :211-220
[7]   TransVG: End-to-End Visual Grounding with Transformers [J].
Deng, Jiajun ;
Yang, Zhengyuan ;
Chen, Tianlang ;
Zhou, Wengang ;
Li, Houqiang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1749-1759
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Du Yunhao., 2022, 2022 IEEE INT C MULT, P1, DOI [DOI 10.1109/ICME52920.2022.9859880, 10.1109/ICME52920.2022.9859880]
[10]   Modularized Textual Grounding for Counterfactual Resilience [J].
Fang, Zhiyuan ;
Kong, Shu ;
Fowlkes, Charless ;
Yang, Yezhou .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6371-6381