Deconfounded Image Captioning: A Causal Retrospect

被引：62

作者：

Yang, Xu ^{[1
]}

Zhang, Hanwang ^{[2
]}

Cai, Jianfei ^{[3
]}

机构：

[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China

[2] Nanyang Technol Univ, Singapore 639798, Singapore

[3] Monash Univ, Fac IT, Data Sci & AI Dept, Clayton, Vic 3800, Australia

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 11期

基金：

美国国家科学基金会;

关键词：

Training; Correlation; Toy manufacturing industry; Visualization; Task analysis; Shape; Magnetic heads; Image captioning; causality; deconfounding; the backdoor adjustment; the front-door adjustment; DROPOUT;

D O I：

10.1109/TPAMI.2021.3121705

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset, respectively. Interestingly, DICv1.0 is a natural derivation from our causal retrospect, which opens promising directions for image captioning.

引用

页码：12996 / 13010

页数：15

共 86 条

[1] Counterfactual Vision and Language Learning [J].

Abbasnejad, Ehsan ;

Teney, Damien ;

Parvaneh, Amin ;

Shi, Javen ;

van den Hengel, Anton .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10041-10051

[2] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].

Agrawal, Aishwarya ;

Batra, Dhruv ;

Parikh, Devi ;

Kembhavi, Aniruddha .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980

[3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[4] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[5]

[Anonymous], 2000, Causality: models, reasoning and inference

[6]

[Anonymous], 2009, International Conference on Machine Learning, DOI [DOI 10.1145/1553374.1553463, 10.1145/1553374.1553463]

[7] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[8]

Atwood J, 2016, ADV NEUR IN, V29

[9] The dropout learning algorithm [J].

Baldi, Pierre ;

Sadowski, Peter .

ARTIFICIAL INTELLIGENCE, 2014, 210 :78-122

[10]

Banerjee S., 2005, P ACL WORKSHOP INTRI, DOI DOI 10.3115/1626355.1626389

← 1 2 3 4 5 6 7 8 9 →