An Ensemble of Vision-Language Transformer-Based Captioning Model With Rotatory Positional Embeddings

被引：0

作者：

Sathyanarayana, K. B. ^{[1
,2
]}

Naik, Dinesh ^{[1
]}

机构：

[1] Natl Inst Technol Karnataka, Dept Informat Technol, Surathkal 575025, Karnataka, India

[2] JNN Coll Engn, Dept Informat Sci & Engn, Shivamogga 577204, Karnataka, India

来源：

IEEE ACCESS | 2025年 / 13卷

关键词：

Transformers; Feature extraction; Computational modeling; Computer architecture; Bidirectional long short term memory; Semantics; Decoding; Adaptation models; Accuracy; Data models; Attention mechanisms; bidirectional long short term memory model; convolutional neural network model; graph convolution network; image caption generation; positional embedding; rotary positional embedding; transformer;

D O I：

10.1109/ACCESS.2025.3556449

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image captioning is a dynamic and crucial research area focused on automatically generating image textual descriptions. Traditional models, primarily employing an encoder-decoder framework with Convolutional Neural Networks (CNNs), often struggle to capture the complex spatial and sequential relationships inherent in visual data. This gap in performance underscores the necessity for more sophisticated solutions. The proposed work introduces a groundbreaking ensemble model that integrates CNN, Graph Convolutional Network (GCN), Bidirectional Long Short-Term Memory (BiLSTM), and Transformer architectures. Our approach achieves an outstanding 97% increase in CIDEr scores on the Flickr30K dataset and a remarkable 28.6% improvement on the Flickr8K dataset, thanks to the innovative implementation of Rotary Positional Encoding (RoPE). By strategically incorporating GCN and BiLSTM layers, our model adeptly captures essential relationships within the data. This groundbreaking research effectively addresses the challenges of image captioning, leveraging a powerful combination of advanced architectures. As a result, our model significantly enhances the generation of accurate and contextually rich captions, positioning it as a game-changer for automated image-to-text applications. The proposed Ensemble model with RoPE, achieved impressive performance on the Flickr8k and Flickr30k datasets, with scores of 80.62 and 95.0 for BLEU-1, 72.01 and 90.51 for BLEU-2, 63.12 and 81.24 for BLEU-3, 48.32 and 68.8 for BLEU-4, 74.26 and 81.89 for METEOR, 80.24 and 84.29 for ROUGE-L, 118.94 and 155.77 for CIDEr, and 48.7 and 39.0 for SPICE, respectively.

引用

页码：59841 / 59865

页数：25

共 40 条

[1] Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review [J].

Abdollahi, Abolfazl ;

Pradhan, Biswajeet ;

Shukla, Nagesh ;

Chakraborty, Subrata ;

Alamri, Abdullah .

REMOTE SENSING, 2020, 12 (09)

[2] Image and Video Captioning for Apparels Using Deep Learning [J].

Agarwal, Govind ;

Jindal, Kritika ;

Chowdhury, Abishi ;

Singh, Vishal K. ;

Pal, Amrit .

IEEE ACCESS, 2024, 12 :113138-113150

[3] An ensemble model with attention based mechanism for image captioning [J].

Al Badarneh, Israa ;

Hammo, Bassam H. ;

Al-Kadi, Omar .

COMPUTERS & ELECTRICAL ENGINEERING, 2025, 123

[4]

Allaouzi I., 2018, P 2 INT C NAT LANG S, P6

[5] Convolutional Image Captioning [J].

Aneja, Jyoti ;

Deshpande, Aditya ;

Schwing, Alexander G. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5561-5570

[6] Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning [J].

Bayisa, Leta Yobsan ;

Wang, Weidong ;

Wang, Qingxian ;

Ukwuoma, Chiagoziem C. ;

Gutema, Hirpesa Kebede ;

Endris, Ahmed ;

Abu, Turi .

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (10) :4617-4637

[7] MIML library: A modular and flexible library for multi-instance multi-label learning [J].

Belmonte, Alvaro ;

Zafra, Amelia ;

Gibaja, Eva .

NEUROCOMPUTING, 2022, 500 :632-636

[8] An intelligent method for pregnancy diagnosis in breeding sows according to ultrasonography algorithms [J].

Chae, Jung -woo ;

Choi, Yo-han ;

Lee, Jeong-nam ;

Park, Hyun-ju ;

Jeong, Yong-dae ;

Cho, Eun-seok ;

Kim, Young-sin ;

Kim, Tae-kyeong ;

Sa, Soo-jin ;

Cho, Hyun-chong .

JOURNAL OF ANIMAL SCIENCE AND TECHNOLOGY, 2023, 65 (02) :365-376

[9] Pre-Trained Image Processing Transformer [J].

Chen, Hanting ;

Wang, Yunhe ;

Guo, Tianyu ;

Xu, Chang ;

Deng, Yiping ;

Liu, Zhenhua ;

Ma, Siwei ;

Xu, Chunjing ;

Xu, Chao ;

Gao, Wen .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12294-12305

[10]

Chen X, 2015, PROC CVPR IEEE, P2422, DOI 10.1109/CVPR.2015.7298856

← 1 2 3 4 →