End-to-End Automatic Image Annotation Based on Deep CNN and Multi-Label Data Augmentation

被引:77
作者
Ke, Xiao [1 ,2 ]
Zou, Jiawei [3 ]
Niu, Yuzhen [1 ,2 ]
机构
[1] Fuzhou Univ, Minist Educ, Coll Math & Comp Sci, Fujian Key Lab Network Comp & Intelligent Informa, Fuzhou 350116, Fujian, Peoples R China
[2] Fuzhou Univ, Minist Educ, Key Lab Spatial Data Min & Informat Sharing, Fuzhou 350116, Fujian, Peoples R China
[3] Fuzhou Univ, Coll Math & Comp Sci, Fujian Key Lab Network Comp & Intelligent Informa, Fuzhou 350116, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Image annotation; convolutional neural network; deep learning; generative adversarial networks; data augmentation; FUSION; NETWORKS; RANK;
D O I
10.1109/TMM.2019.2895511
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic image annotation is a key step in image retrieval and image understanding. In this paper, we present an end-to-end automatic image annotation method based on a deep convolutional neural network (CNN) and multi-label data augmentation. Different from traditional annotation models that usually perform feature extraction and annotation as two independent tasks, we propose an end-to-end automatic image annotation model based on deep CNN (E2E-DCNN). E2E-DCNN transforms the image annotation problem into a multi-label learning problem. It uses a deep CNN structure to carry out the adaptive feature learning before constructing the end-to-end annotation structure using multiple cross-entropy loss functions for training. It is difficult to train a deep CNN model using small-scale datasets or scale up multi-label datasets using traditional data augmentation methods; hence, we propose a multi-label data augmentation method based on Wasserstein generative adversarial networks (ML-WGAN). The ML-WGAN generator can approximate the data distribution of a single multi-label image. The images generated by ML-WGAN can assist in the reduction of the over-fitting problem of training a deep CNN model and enhance the generalization ability of the trained CNN model. We optimize the network structure by using deformable convolution and spatial pyramid pooling. We experiment the proposed E2E-DCNN model with data augmentation by the proposed ML-WGAN on several public datasets. The experimental results demonstrate that the proposed model outperforms the state-of-the-art automatic image annotation models.
引用
收藏
页码:2093 / 2106
页数:14
相关论文
共 50 条
[1]   Efficient multi-modal fusion on supergraph for scalable image annotation [J].
Amiri, S. Hamid ;
Jarnzad, Mansour .
PATTERN RECOGNITION, 2015, 48 (07) :2241-2253
[2]  
[Anonymous], PROC CVPR IEEE
[3]  
[Anonymous], 2015, ARXIV PREPRINT ARXIV
[4]  
[Anonymous], 2017, ARXIV170801911
[5]  
[Anonymous], 2017, ARXIV170107875
[6]  
[Anonymous], 2012, NIPS
[7]  
[Anonymous], 2015, PROC CVPR IEEE
[8]  
[Anonymous], P IEEE INT C COMP VI
[9]  
[Anonymous], P ICCV
[10]  
Barnard K, 2001, EIGHTH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOL II, PROCEEDINGS, P408, DOI 10.1109/ICCV.2001.937654