Deconfounded Image Captioning: A Causal Retrospect

被引:51
作者
Yang, Xu [1 ]
Zhang, Hanwang [2 ]
Cai, Jianfei [3 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
[2] Nanyang Technol Univ, Singapore 639798, Singapore
[3] Monash Univ, Fac IT, Data Sci & AI Dept, Clayton, Vic 3800, Australia
基金
美国国家科学基金会;
关键词
Training; Correlation; Toy manufacturing industry; Visualization; Task analysis; Shape; Magnetic heads; Image captioning; causality; deconfounding; the backdoor adjustment; the front-door adjustment; DROPOUT;
D O I
10.1109/TPAMI.2021.3121705
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset, respectively. Interestingly, DICv1.0 is a natural derivation from our causal retrospect, which opens promising directions for image captioning.
引用
收藏
页码:12996 / 13010
页数:15
相关论文
共 86 条
[1]   Counterfactual Vision and Language Learning [J].
Abbasnejad, Ehsan ;
Teney, Damien ;
Parvaneh, Amin ;
Shi, Javen ;
van den Hengel, Anton .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10041-10051
[2]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[3]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[4]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[5]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[6]  
Atwood J, 2016, ADV NEUR IN, V29
[7]   The dropout learning algorithm [J].
Baldi, Pierre ;
Sadowski, Peter .
ARTIFICIAL INTELLIGENCE, 2014, 210 :78-122
[8]  
Banerjee S, 2005, P ACL WORKSH INTR EX, P65
[9]  
Berg T., 2014, EMNLP
[10]  
Bolukbasi T, 2016, ADV NEUR IN, V29