Unpaired Image Captioning by Image-Level Weakly-Supervised Visual Concept Recognition

被引:3
作者
Zhu, Peipei [1 ,2 ]
Wang, Xiao [2 ]
Luo, Yong [3 ]
Sun, Zhenglong [4 ]
Zheng, Wei-Shi [2 ,5 ]
Wang, Yaowei [2 ]
Chen, Changwen [6 ]
机构
[1] Chinese Univ Hong Kong, Sch Sci & Engn, Shenzhen 518172, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Peoples R China
[4] Chinese Univ Hong Kong, Sch Sci & Engn, Shenzhen 518172, Peoples R China
[5] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou 510275, Peoples R China
[6] Hong Kong Polytech Univ, Dept Comp, Hong Kong 999077, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Image recognition; Task analysis; Object detection; Training; Annotations; Data models; Graph neural network; unpaired image captioning; weakly-supervised instance segmentation;
D O I
10.1109/TMM.2022.3214090
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The goal of unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase. Although challenging, we expect the task can be accomplished by leveraging images aligned with visual concepts. Most existing studies use off-the-shelf algorithms to obtain the visual concepts because the Bounding Box (BBox) labels or relationship-triplet labels used for training are expensive to acquire. To avoid exhaustive annotations, we propose a novel approach to achieve cost-effective UIC. Specifically, we adopt image-level labels to optimize the UIC model in a weakly-supervised manner. For each image, we assume that only the image-level labels are available without specific locations and numbers. The image-level labels are utilized to train a weakly-supervised object recognition model to extract object information (e.g., instance), and the extracted instances are adopted to infer the relationships among different objects using an enhanced graph neural network (GNN). The proposed approach achieves comparable or even better performance compared with previous methods without expensive annotations. Furthermore, we design an unrecognized object (UnO) loss to improve the alignment of the inferred object and relationship information with the images. It can effectively alleviate the issue encountered by existing UIC models when generating sentences with nonexistent objects. To the best of our knowledge, this is the first attempt to address the problem of Weakly-Supervised visual concept recognition for UIC (WS-UIC) based only on image-level labels. Extensive experiments demonstrate that the proposed method achieves inspiring results on the COCO dataset while significantly reducing the labeling cost.
引用
收藏
页码:6702 / 6716
页数:15
相关论文
共 71 条
[1]   Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations [J].
Ahn, Jiwoon ;
Cho, Sunghyun ;
Kwak, Suha .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2204-2213
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
[Anonymous], 2010, P ACM SIGKDD WORKSH, DOI DOI 10.1145/1837885.1837906
[4]   Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks [J].
Baldassarre, Federico ;
Smith, Kevin ;
Sullivan, Josephine ;
Azizpour, Hossein .
COMPUTER VISION - ECCV 2020, PT XXVIII, 2020, 12373 :612-630
[5]   Unpaired Image Captioning With semantic-Constrained Self-Learning [J].
Ben, Huixia ;
Pan, Yingwei ;
Li, Yehao ;
Yao, Ting ;
Hong, Richang ;
Wang, Meng ;
Mei, Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 :904-916
[6]   Interactions Guided Generative Adversarial Network for unsupervised image captioning [J].
Cao, Shan ;
An, Gaoyun ;
Zheng, Zhenxing ;
Ruan, Qiuqi .
NEUROCOMPUTING, 2020, 417 :419-431
[7]  
Chen WH, 2017, Arxiv, DOI arXiv:1611.05321
[8]   Visual Dialog [J].
Das, Abhishek ;
Kottur, Satwik ;
Gupta, Khushi ;
Singh, Avi ;
Yadav, Deshraj ;
Moura, Jose M. F. ;
Parikh, Devi ;
Batra, Dhruv .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1080-1089
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]  
Denkowski M. J., 2014, P 9 WORKSHOP STAT MA, P376, DOI DOI 10.3115/V1/W14-3348