Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi

被引：1

作者：

Mishra, Santosh Kumar ^{[1
]}

Sinha, Sushant ^{[1
]}

Saha, Sriparna ^{[1
]}

Bhattacharyya, Pushpak ^{[2
]}

机构：

[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India

[2] Indian Inst Technol, Dept Comp Sci & Engn, Bombay, Maharashtra, India

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 04期

关键词：

Hindi; dynamic convolution; attention; deep learning;

D O I：

10.1145/3573891

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed dimensional vector representation in the image captioning task, whereas a decoder, a recurrent neural network, performs language modeling and generates the target descriptions. Recent CNNs use the same operation over every pixel; however, all the image pixels are not equally important. To address this, the proposed method uses a dynamic convolution-based encoder for image encoding or feature extraction, Long-Short-Term-Memory as a decoder for language modeling, and X-Linear attention to make the system robust. Encoders, attentions, and decoders are important aspects of the image captioning task; therefore, we experiment with various encoders, decoders, and attention mechanisms. Most of the works for image captioning have been carried out for the English language in the existing literature. We propose a novel approach for caption generation from images in Hindi. Hindi, widely spoken in South Asia and India, is the fourth most-spoken language globally; it is India's official language. The proposed method utilizes dynamic convolution operation on the encoder side to obtain a better image encoding quality. The Hindi image captioning dataset is manually created by translating the popular MSCOCO dataset from English to Hindi. In terms of BLEU scores, the performance of the proposed method is compared with other baselines, and the results obtained show that the proposed method outperforms different baselines. Manual human assessment in terms of adequacy and fluency of the captions generated further determines the efficacy of the proposed method in generating good-quality captions.

引用

页数：18

共 58 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[2] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[3] Cho K., 2014, P 2014 C EMP METH NA, DOI 10.3115/v1/d14-1179
[4] Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
[5] Histograms of oriented gradients for human detection
Dalal, N
Triggs, B
[J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 886 - 893
[6] Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech
Deshpande, Aditya
Aneja, Jyoti
Wang, Liwei
Schwing, Alexander
Forsyth, David
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10687 - 10696
[7] Dhir R, 2019, COMPUT SIST, V23, P693, DOI [10.13053/CyS-23-3-3269, 10.13053/cys-23-3-3269]
[8] Elliott D., 2013, EMNLP, P1292
[9] Every Picture Tells a Story: Generating Sentences from Images
Farhadi, Ali
Hejrati, Mohsen
Sadeghi, Mohammad Amin
Young, Peter
Rashtchian, Cyrus
Hockenmaier, Julia
Forsyth, David
[J]. COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 : 15 - +
[10] Unsupervised Image Captioning
Feng, Yang
Ma, Lin
Liu, Wei
Luo, Jiebo
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4120 - 4129

← 1 2 3 4 5 6 →