Multilabel Image Classification With Regional Latent Semantic Dependencies

被引：89

作者：

Zhang, Junjie ^{[1
,2
]}

Wu, Qi ^{[3
,4
]}

Shen, Chunhua ^{[3
,4
]}

Zhang, Jian ^{[2
]}

Lu, Jianfeng ^{[1
]}

机构：

[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Jiangsu, Peoples R China

[2] Univ Technol Sydney, Fac Engn & Informat Technol, Sydney, NSW 2007, Australia

[3] Univ Adelaide, Australia Ctr Robot Vis, Adelaide, SA 5005, Australia

[4] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2018年 / 20卷 / 10期

关键词：

Multilabel image classification; semantic dependence; deep neural network; ANNOTATION; GRADIENTS;

D O I：

10.1109/TMM.2018.2812605

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep convolution neural networks (CNNs) have demonstrated advanced performance on single-label image classification, and various progress also has been made to apply CNN methods on multilabel image classification, which requires annotating objects, attributes, scene categories, etc., in a single shot. Recent state-of-the-art approaches to the multilabel image classification exploit the label dependencies in an image, at the global level, largely improving the labeling capacity. However, predicting small objects and visual concepts is still challenging due to the limited discrimination of the global visual features. In this paper, we propose a regional latent semantic dependencies model (RLSD) to address this problem. The utilized model includes a fully convolutional localization architecture to localize the regions that may contain multiple highly dependent labels. The localized regions are further sent to the recurrent neural networks to characterize the latent semantic dependencies at the regional level. Experimental results on several benchmark datasets show that our proposed model achieves the best performance compared to the state-of-the-art models, especially for predicting small objects occurring in the images. Also, we set up an upper bound model (RLSD+ft-RPN) using bounding-box coordinates during training, and the experimental results also show that our RLSD can approach the upper bound without using the bounding-box annotations, which is more realistic in the real world.

引用

页码：2801 / 2813

页数：13

共 57 条

[31] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations [J].

Krishna, Ranjay ;

Zhu, Yuke ;

Groth, Oliver ;

Johnson, Justin ;

Hata, Kenji ;

Kravitz, Joshua ;

Chen, Stephanie ;

Kalantidis, Yannis ;

Li, Li-Jia ;

Shamma, David A. ;

Bernstein, Michael S. ;

Li Fei-Fei .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) :32-73

[32]

Krizhevsky A., 2017, COMMUN ACM, V60, P84, DOI [DOI 10.1145/3065386, 10.1145/3065386]

[33] A Distributed Approach Toward Discriminative Distance Metric Learning [J].

Li, Jun ;

Lin, Xun ;

Rui, Xiaoguang ;

Rui, Yong ;

Tao, Dacheng .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (09) :2111-2122

[34]

Li X, 2014, UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, P430

[35] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

[36]

Liu D., 2010, Proceedings of ACM International Conference on Multimedea, P25

[37] Distinctive image features from scale-invariant keypoints [J].

Lowe, DG .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2004, 60 (02) :91-110

[38]

Mikolov T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1045

[39] A comparative study of texture measures with classification based on feature distributions [J].

Ojala, T ;

Pietikainen, M ;

Harwood, D .

PATTERN RECOGNITION, 1996, 29 (01) :51-59

[40]

Oquab M., 2014, P NIPS, P1

← 1 2 3 4 5 6 →