LAM: Remote Sensing Image Captioning with Label-Attention Mechanism

被引：48

作者：

Zhang, Zhengyuan ^{[1
,2
,3
]}

Diao, Wenhui ^{[1
,2
]}

Zhang, Wenkai ^{[1
,2
]}

Yan, Menglong ^{[1
,2
]}

Gao, Xin ^{[1
,2
]}

Sun, Xian ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Elect, Beijing 100190, Peoples R China

[2] Chinese Acad Sci, Inst Elect, Key Lab Network Informat Syst Technol NIST, Beijing 100190, Peoples R China

[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100190, Peoples R China

来源：

REMOTE SENSING | 2019年 / 11卷 / 20期

基金：

中国国家自然科学基金;

关键词：

remote sensing image captioning; remote sensing image; image understanding; semantic understanding; MODELS;

D O I：

10.3390/rs11202349

中图分类号：

X [环境科学、安全科学];

学科分类号：

08 ; 0830 ;

摘要：

Significant progress has been made in remote sensing image captioning by encoder-decoder frameworks. The conventional attention mechanism is prevalent in this task but still has some drawbacks. The conventional attention mechanism only uses visual information about the remote sensing images without considering using the label information to guide the calculation of attention masks. To this end, a novel attention mechanism, namely Label-Attention Mechanism (LAM), is proposed in this paper. LAM additionally utilizes the label information of high-resolution remote sensing images to generate natural sentences to describe the given images. It is worth noting that, instead of high-level image features, the predicted categories' word embedding vectors are adopted to guide the calculation of attention masks. Representing the content of images in the form of word embedding vectors can filter out redundant image features. In addition, it can also preserve pure and useful information for generating complete sentences. The experimental results from UCM-Captions, Sydney-Captions and RSICD demonstrate that LAM can improve the model's performance for describing high-resolution remote sensing images and obtain better S-m scores compared with other methods. S-m score is a hybrid scoring method derived from the AI Challenge 2017 scoring method. In addition, the validity of LAM is verified by the experiment of using true labels.

引用

页数：15

共 36 条

[1]

Aneja J., 2017, ARXIV171109151

[2]

[Anonymous], PROC CVPR IEEE

[3]

[Anonymous], 2015, P INT C LEARN REP IC

[4]

[Anonymous], ADV NEURAL INFORM PR

[5] Boosted Attention: Leveraging Human Attention for Image Captioning [J].

Chen, Shi ;

Zhao, Qi .

COMPUTER VISION - ECCV 2018, PT XI, 2018, 11215 :72-88

[6]

Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754

[7] Every Picture Tells a Story: Generating Sentences from Images [J].

Farhadi, Ali ;

Hejrati, Mohsen ;

Sadeghi, Mohammad Amin ;

Young, Peter ;

Rashtchian, Cyrus ;

Hockenmaier, Julia ;

Forsyth, David .

COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+

[8] An End-to-End Neural Network for Road Extraction From Remote Sensing Imagery by Multiple Feature Pyramid Network [J].

Gao, Xun ;

Sun, Xian ;

Zhang, Yi ;

Yan, Menglong ;

Xu, Guangluan ;

Sun, Hao ;

Jiao, Jiao ;

Fu, Kun .

IEEE ACCESS, 2018, 6 :39401-39414

[9]

Gong YC, 2014, LECT NOTES COMPUT SC, V8692, P529, DOI 10.1007/978-3-319-10593-2_35

[10] Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics [J].

Hodosh, Micah ;

Young, Peter ;

Hockenmaier, Julia .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 :853-899

← 1 2 3 4 →