Unpaired Image Captioning via Scene Graph Alignments

被引:135
作者
Gu, Jiuxiang [1 ]
Joty, Shafiq [1 ,4 ]
Cai, Jianfei [1 ,2 ]
Zhao, Handong [3 ]
Yang, Xu [1 ]
Wang, Gang [5 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Monash Univ, Clayton, Vic, Australia
[3] Adobe Res, San Jose, CA USA
[4] Salesforce Res Asia, Singapore, Singapore
[5] Alibaba Grp, Hangzhou, Peoples R China
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.01042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.
引用
收藏
页码:10322 / 10331
页数:10
相关论文
共 41 条
[11]   Blind Super-Resolution With Iterative Kernel Correction [J].
Gu, Jinjin ;
Lu, Hannan ;
Zuo, Wangmeng ;
Dong, Chao .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1604-1613
[12]   Unpaired Image Captioning by Language Pivoting [J].
Gu, Jiuxiang ;
Joty, Shafiq ;
Cai, Jianfei ;
Wang, Gang .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :519-535
[13]   Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].
Gu, Jiuxiang ;
Cai, Jianfei ;
Joty, Shafiq ;
Niu, Li ;
Wang, Gang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189
[14]   An Empirical Study of Language CNN for Image Captioning [J].
Gu, Jiuxiang ;
Wang, Gang ;
Cai, Jianfei ;
Chen, Tsuhan .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1231-1240
[15]  
Gulrajani I., 2017, Advances in neural information processing systems, P5769, DOI [10.5555/3295222.3295327, DOI 10.5555/3295222.3295327]
[16]  
Hitschler J., 2016, ACL
[17]  
Kingma DP, 2014, ARXIV
[18]   Accurate unlexicalized parsing [J].
Klein, D ;
Manning, CD .
41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, :423-430
[19]   Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations [J].
Krishna, Ranjay ;
Zhu, Yuke ;
Groth, Oliver ;
Johnson, Justin ;
Hata, Kenji ;
Kravitz, Joshua ;
Chen, Stephanie ;
Kalantidis, Yannis ;
Li, Li-Jia ;
Shamma, David A. ;
Bernstein, Michael S. ;
Li Fei-Fei .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) :32-73
[20]  
Lample G., 2018, 6 INT C LEARN REPR I