Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

被引:0
作者
K. Revati Suresh
Arun Jarapala
P. V. Sudeep
机构
[1] National Institute of Technology Calicut,Department of Electronics and Communication Engineering
来源
Circuits, Systems, and Signal Processing | 2022年 / 41卷
关键词
Deep learning; Image captioning; Natural language processing; Convolutional neural network; Recurrent neural network;
D O I
暂无
中图分类号
学科分类号
摘要
An image caption generator produces syntactically and semantically correct sentences to narrate the scene of a natural image. A neural image caption (NIC) generator is a popular deep learning model for automatically generating image captions in plain English. The NIC generator combines a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder. This paper investigates the performance of different CNN encoders and recurrent neural network decoders for finding the best NIC generator model for image captioning. Besides, we test the image caption generators with four image inject models and with decoding strategies such as greedy search and beam search. We conducted experiments on the Flickr8k dataset and analyzed the results qualitatively and quantitatively. Our results validate the automated image caption generator with ResNet-101 encoder, and the LSTM/gated recurrent units decoder outperforms the popular neural image caption NIC generator in the presence of par-inject concatenate conditioning and beam search. For quantitative assessment, we used ROUGEL\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ROUGE_L$$\end{document}, CIDErD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$CIDEr_D$$\end{document}, and BLEUn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$BLEU_n$$\end{document} scores to compare the different models.
引用
收藏
页码:5719 / 5742
页数:23
相关论文
共 37 条
[1]  
Bai S(2018)A survey on automatic image caption generation Neurocomputing 311 291-304
[2]  
An S(2017)Neural network methods for natural language processing Synth. Lect. Hum. Lang. Technol. 10 1-309
[3]  
Goldberg Y(2016)LSTM A search space odyssey IEEE Trans. Neural Netw. Learn. Syst. 28 2222-2232
[4]  
Greff K(1998)The vanishing gradient problem during learning recurrent neural nets and problem solutions Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 6 107-116
[5]  
Srivastava RK(1997)Long short-term memory Neural Comput. 9 1735-1780
[6]  
Koutník J(2013)Framing image description as a ranking task: data, models and evaluation metrics J. Artif. Intell. Res. 47 853-899
[7]  
Steunebrink BR(2020)An integrated hybrid CNN-RNN model for visual description and generation of captions Circ. Syst. Signal Process. 39 776-788
[8]  
Schmidhuber J(2019)A survey on deep neural network-based image captioning Vis. Comput. 35 445-470
[9]  
Hochreiter S(2004)Distinctive image features from scale-invariant keypoints Int. J. Comput. Vis. 60 91-110
[10]  
Hochreiter S(2020)A deep learning-based framework for automatic brain tumors classification using transfer learning Circ. Syst. Signal Process. 39 757-775