GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi

被引:0
作者
Mishra, Santosh Kumar [1 ]
Chakraborty, Soham [2 ]
Saha, Sriparna [3 ]
Bhattacharyya, Pushpak [4 ]
机构
[1] Rajiv Gandhi Inst Petr Technol, Dept Comp Sci & Engn, Amethi, India
[2] Kalinga Inst Ind Technol, Sch Comp Sci & Engn, Bhubaneswar, India
[3] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India
[4] Indian Inst Technol, Dept Comp Sci & Engn, Bombay, Maharashtra, India
关键词
Deep learning; attention; GPT-2; Hindi;
D O I
10.1145/3622936
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning frameworks usually employ an encoder-decoder paradigm, with the encoder receiving abstract image feature vectors as input and the decoder for language modeling. Nowadays, most prominent architectures employ features from region proposals derived from object detection modules. In this work, we propose a novel architecture for image captioning. We employ the object detection module integrated with transformer architecture as an encoder and GPT-2 (Generative Pre-trained Transformer) as a decoder. The encoder utilizes the information of the spatial relationships among detected objects. We introduce a unique methodology for image caption generation in Hindi, which is widely spoken in South Asia and India and is the world's third most spoken language as well as India's official language. In terms of BLEU scores, the proposed approach's performance is comparable to those of other baselines, and the results illustrate that the proposed approach outperforms the other baselines. The efficacy of the proposed approach in generating correct captions is further determined by human assessment in terms of adequacy and fluency.
引用
收藏
页数:16
相关论文
共 47 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis
    Barraco, Manuele
    Cornia, Marcella
    Cascianelli, Silvia
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4661 - 4669
  • [3] Cho KYHY, 2014, Arxiv, DOI [arXiv:1406.1078, DOI 10.48550/ARXIV.1406.1078]
  • [4] Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
  • [5] Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech
    Deshpande, Aditya
    Aneja, Jyoti
    Wang, Liwei
    Schwing, Alexander
    Forsyth, David
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10687 - 10696
  • [6] Dhir R, 2019, COMPUT SIST, V23, P693, DOI [10.13053/CyS-23-3-3269, 10.13053/cys-23-3-3269]
  • [7] Elliott D., 2013, EMNLP, P1292
  • [8] Every Picture Tells a Story: Generating Sentences from Images
    Farhadi, Ali
    Hejrati, Mohsen
    Sadeghi, Mohammad Amin
    Young, Peter
    Rashtchian, Cyrus
    Hockenmaier, Julia
    Forsyth, David
    [J]. COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 : 15 - +
  • [9] DeeCap: Dynamic Early Exiting for Efficient Image Captioning
    Fei, Zhengcong
    Yan, Xu
    Wang, Shuhui
    Tian, Qi
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12206 - 12216
  • [10] Unsupervised Image Captioning
    Feng, Yang
    Ma, Lin
    Liu, Wei
    Luo, Jiebo
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4120 - 4129