MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images

被引：8

作者：

Huang, Haiyan ^{[1
]}

Shao, Zhenfeng ^{[1
,7
]}

Cheng, Qimin ^{[2
]}

Huang, Xiao ^{[3
]}

Wu, Xiaoping ^{[4
]}

Li, Guoming ^{[5
]}

Tan, Li ^{[6
]}

机构：

[1] Wuhan Univ, State Key Lab Informat Engn Surveying Mapping & Re, Wuhan, Peoples R China

[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Peoples R China

[3] Univ Arkansas, Dept Geosci, Fayetteville, AR USA

[4] Sichuan Normal Univ, Sch Geog & Resources Sci, Chengdu, Sichuan, Peoples R China

[5] Univ Elect Sci & Technol, Sch Resources & Environm, Chengdu, Sichuan, Peoples R China

[6] Chengdu Univ Technol, Sch Geophys, Chengdu, Sichuan, Peoples R China

[7] Wuhan Univ, State Key Lab Informat Engn Surveying Mapping & Re, Wuhan 430079, Peoples R China

来源：

INTERNATIONAL JOURNAL OF DIGITAL EARTH | 2023年 / 16卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Image captioning; deep learning; semantic understanding; visual-text alignment; MODELS;

D O I：

10.1080/17538947.2023.2283482

中图分类号：

P9 [自然地理学];

学科分类号：

0705 ; 070501 ;

摘要：

Remote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.

引用

页码：4848 / 4866

页数：19

共 45 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3] Convolutional Image Captioning [J].

Aneja, Jyoti ;

Deshpande, Aditya ;

Schwing, Alexander G. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5561-5570

[4]

Banerjee S, 2005, P ACL WORKSH INTR EX, P65

[5] NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning [J].

Cheng, Qimin ;

Huang, Haiyan ;

Xu, Yuan ;

Zhou, Yuzhuo ;

Li, Huanying ;

Wang, Zhongyuan .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60

[6] A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval [J].

Cheng, Qimin ;

Huang, Haiyan ;

Ye, Lan ;

Fu, Peng ;

Gan, Deqiao ;

Zhou, Yuzhuo .

REMOTE SENSING, 2021, 13 (24)

[7] Incorporating DeepLabv3+and object-based image analysis for semantic segmentation of very high resolution remote sensing images [J].

Du, Shouji ;

Du, Shihong ;

Liu, Bo ;

Zhang, Xiuyuan .

INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2021, 14 (03) :357-378

[8] Every Picture Tells a Story: Generating Sentences from Images [J].

Farhadi, Ali ;

Hejrati, Mohsen ;

Sadeghi, Mohammad Amin ;

Young, Peter ;

Rashtchian, Cyrus ;

Hockenmaier, Julia ;

Forsyth, David .

COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+

[9] MDSNet: a multiscale decoupled supervision network for semantic segmentation of remote sensing images [J].

Feng, Jiangfan ;

Chen, Panyu ;

Gu, Zhujun ;

Zeng, Maimai ;

Zheng, Wei .

INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2023, 16 (01) :2844-2861

[10] Boosting Memory with a Persistent Memory Mechanism for Remote Sensing Image Captioning [J].

Fu, Kun ;

Li, Yang ;

Zhang, Wenkai ;

Yu, Hongfeng ;

Sun, Xian .

REMOTE SENSING, 2020, 12 (11)

← 1 2 3 4 5 →