Global-Attention-Based Neural Networks for Vision Language Intelligence

被引：14

作者：

Liu, Pei ^{[1
]}

Zhou, Yingjie ^{[1
]}

Peng, Dezhong ^{[1
,2
,3
]}

Wu, Dapeng ^{[4
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

[2] Sichuan Zhiqian Technol Co Ltd, Chengdu 610041, Peoples R China

[3] Shenzhen Peng Cheng Lab, Shenzhen 518052, Peoples R China

[4] Univ Florida, Dept Elect & Comp Engn, Gainesville, FL 32611 USA

来源：

IEEE-CAA JOURNAL OF AUTOMATICA SINICA | 2021年 / 8卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Global attention; image captioning; latent contribution;

D O I：

10.1109/JAS.2020.1003402

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we develop a novel global-attention-based neural network (GANN) for vision language intelligence, specifically, image captioning (language description of a given image). As many previous works, the encoder-decoder framework is adopted in our proposed model, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects, and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer. The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning, and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation. Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer. In our experiments, we qualitatively analyzed the proposed model, and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset. Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning.

引用

页码：1243 / 1252

页数：10

共 50 条

[21] Attention-Enhanced Graph Neural Networks With Global Context for Session-Based Recommendation
Chen, Yingpei
Tang, Yan
Yuan, Yuan
IEEE ACCESS, 2023, 11 : 26237 - 26246
[22] A cocrystal prediction method of graph neural networks based on molecular spatial information and global attention
Kang, Yanlei
Chen, Jiahui
Hu, Xiurong
Jiang, Yunliang
Li, Zhong
CRYSTENGCOMM, 2023, 25 (46) : 6405 - 6415
[23] Graph neural networks in vision-language image understanding: a survey
Senior, Henry
Slabaugh, Gregory
Yuan, Shanxin
Rossi, Luca
VISUAL COMPUTER, 2025, 41 (01): : 491 - 516
[24] Graph neural networks in vision-language image understanding: a surveyGraph neural networks in vision-language image understanding: a surveyH. Senior et al.
Henry Senior
Gregory Slabaugh
Shanxin Yuan
Luca Rossi
The Visual Computer, 2025, 41 (1) : 491 - 516
[25] Imaging studies of vision, attention and language
Neville, HJ
PROCEEDINGS OF THE EIGHTEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, 1996, : 5 - 6
[26] Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks
Solomon, Rodas
Abebe, Mesfin
APPLIED COMPUTATIONAL INTELLIGENCE AND SOFT COMPUTING, 2023, 2023
[27] Language recognition based on fuzzy neural networks
Wuhan Jiaotong Keji Daxue Xuebao, 1 (39-41):
[28] A Novel Attention-based Aggregation Function to Combine Vision and Language
Stefanini, Matteo
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1212 - 1219
[29] Vision safety system based on cellular neural networks
A. Grabowski
R. A. Kosiński
M. Dźwiarek
Machine Vision and Applications, 2011, 22 : 581 - 590
[30] Vision safety system based on cellular neural networks
Grabowski, A.
Kosinski, R. A.
Dzwiarek, M.
MACHINE VISION AND APPLICATIONS, 2011, 22 (03) : 581 - 590

← 1 2 3 4 5 →