Precise and Faster Image Description Generation with Limited Resources Using an Improved Hybrid Deep Model

被引：2

作者：

Patra, Biswajit ^{[1
]}

Kisku, Dakshina Ranjan ^{[1
]}

机构：

[1] Natl Inst Technol, Dept Comp Sci & Engn, Durgapur 713209, India

来源：

PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PREMI 2023 | 2023年 / 14301卷

关键词：

Image captioning; hybrid pre-trained CNN model; Inception-Resnet-v2; Attention; GRU; Compact Vocabulary; Evaluation metric;

D O I：

10.1007/978-3-031-45170-6_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a model that performs image captioning efficiently based on entity relations, followed by a deep learning-based encoder and decoder model. In order to make image captioning more precise, the proposed model uses Inception-Resnet(version-2) as an encoder and GRU as a decoder. To develop a less expensive and effective image captioning model in accordance with accelerating the training process by reducing the effect of vanishing gradient issues, residual connections are introduced in Inception architecture. Furthermore, the effectiveness of the proposed model has been significantly enhanced by associating the Bahadanu Attention model with GRU. To cut down the computation time and make it a less resource-consuming captioning model, a compact form of the vocabulary of informative words is taken into consideration. The proposed work makes use of the convolution base of the hybrid model to start learning alignment from scratch and learn the correlation among different images and descriptions. The proposed image text generation model is evaluated on Flickr 8k, Flickr 30k, and MSCOCO datasets, and thereby, it produces convincing results on assessments.

引用

页码：166 / 175

页数：10

共 17 条

[1] Image captioning model using attention and object features to mimic human image understanding [J].

Al-Malla, Muhammad Abdelhadie ;

Jafar, Assef ;

Ghneim, Nada .

JOURNAL OF BIG DATA, 2022, 9 (01)

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]

[4]

Bhatia Y., 2019, 2019 12 INT C CONT C, P1

[5] Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks [J].

Cho, Kyunghyun ;

Courville, Aaron ;

Bengio, Yoshua .

IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (11) :1875-1886

[6]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[7]

Jyotsna A., 2023, Computer and Communication Engineering: Third International Conference, CCCE 2023, Revised Selected Papers. Communications in Computer and Information Science (1823), P95, DOI 10.1007/978-3-031-35299-7_8

[8]

Khan R, 2022, Arxiv, DOI arXiv:2203.01594

[9] Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [J].

Li, Xiujun ;

Yin, Xi ;

Li, Chunyuan ;

Zhang, Pengchuan ;

Hu, Xiaowei ;

Zhang, Lei ;

Wang, Lijuan ;

Hu, Houdong ;

Dong, Li ;

Wei, Furu ;

Choi, Yejin ;

Gao, Jianfeng .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :121-137

[10] Neural Baby Talk [J].

Lu, Jiasen ;

Yang, Jianwei ;

Batra, Dhruv ;

Parikh, Devi .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7219-7228

← 1 2 →