Unpaired Image Captioning via Scene Graph Alignments

被引:135
作者
Gu, Jiuxiang [1 ]
Joty, Shafiq [1 ,4 ]
Cai, Jianfei [1 ,2 ]
Zhao, Handong [3 ]
Yang, Xu [1 ]
Wang, Gang [5 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Monash Univ, Clayton, Vic, Australia
[3] Adobe Res, San Jose, CA USA
[4] Salesforce Res Asia, Singapore, Singapore
[5] Alibaba Grp, Hangzhou, Peoples R China
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.01042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.
引用
收藏
页码:10322 / 10331
页数:10
相关论文
共 41 条
[1]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
[Anonymous], 2019, ENGLISH SPEAKING WOR
[4]  
Artetxe Mikel., 2018, ICLR, DOI [DOI 10.18653/V1/D18-1399, 10.18653/v1/D18-1399]
[5]  
Banerjee S., 2005, P ACL WORKSH INTR EX, P65
[6]  
Ding H., 2018, PROC CVPR IEEE, P2393, DOI DOI 10.1109/CVPR.2018.00254
[7]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[8]   Spatio-temporal Video Re-localization by Warp LSTM [J].
Feng, Yang ;
Ma, Lin ;
Liu, Wei ;
Luo, Jiebo .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1288-1297
[9]  
Goodfellow I., 2014, NeurIPS, V27, P1
[10]  
Gu J., 2017, AAAI