Global-Attention-Based Neural Networks for Vision Language Intelligence

被引:14
|
作者
Liu, Pei [1 ]
Zhou, Yingjie [1 ]
Peng, Dezhong [1 ,2 ,3 ]
Wu, Dapeng [4 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
[2] Sichuan Zhiqian Technol Co Ltd, Chengdu 610041, Peoples R China
[3] Shenzhen Peng Cheng Lab, Shenzhen 518052, Peoples R China
[4] Univ Florida, Dept Elect & Comp Engn, Gainesville, FL 32611 USA
基金
中国国家自然科学基金;
关键词
Global attention; image captioning; latent contribution;
D O I
10.1109/JAS.2020.1003402
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we develop a novel global-attention-based neural network (GANN) for vision language intelligence, specifically, image captioning (language description of a given image). As many previous works, the encoder-decoder framework is adopted in our proposed model, in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects, and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer. The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning, and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation. Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer. In our experiments, we qualitatively analyzed the proposed model, and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset. Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning.
引用
收藏
页码:1243 / 1252
页数:10
相关论文
共 50 条
  • [21] Attention-Enhanced Graph Neural Networks With Global Context for Session-Based Recommendation
    Chen, Yingpei
    Tang, Yan
    Yuan, Yuan
    IEEE ACCESS, 2023, 11 : 26237 - 26246
  • [22] A cocrystal prediction method of graph neural networks based on molecular spatial information and global attention
    Kang, Yanlei
    Chen, Jiahui
    Hu, Xiurong
    Jiang, Yunliang
    Li, Zhong
    CRYSTENGCOMM, 2023, 25 (46) : 6405 - 6415
  • [23] Graph neural networks in vision-language image understanding: a survey
    Senior, Henry
    Slabaugh, Gregory
    Yuan, Shanxin
    Rossi, Luca
    VISUAL COMPUTER, 2025, 41 (01): : 491 - 516
  • [24] Graph neural networks in vision-language image understanding: a surveyGraph neural networks in vision-language image understanding: a surveyH. Senior et al.
    Henry Senior
    Gregory Slabaugh
    Shanxin Yuan
    Luca Rossi
    The Visual Computer, 2025, 41 (1) : 491 - 516
  • [25] Imaging studies of vision, attention and language
    Neville, HJ
    PROCEEDINGS OF THE EIGHTEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, 1996, : 5 - 6
  • [26] Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks
    Solomon, Rodas
    Abebe, Mesfin
    APPLIED COMPUTATIONAL INTELLIGENCE AND SOFT COMPUTING, 2023, 2023
  • [27] Language recognition based on fuzzy neural networks
    Wuhan Jiaotong Keji Daxue Xuebao, 1 (39-41):
  • [28] A Novel Attention-based Aggregation Function to Combine Vision and Language
    Stefanini, Matteo
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1212 - 1219
  • [29] Vision safety system based on cellular neural networks
    A. Grabowski
    R. A. Kosiński
    M. Dźwiarek
    Machine Vision and Applications, 2011, 22 : 581 - 590
  • [30] Vision safety system based on cellular neural networks
    Grabowski, A.
    Kosinski, R. A.
    Dzwiarek, M.
    MACHINE VISION AND APPLICATIONS, 2011, 22 (03) : 581 - 590