Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary

被引：8

作者：

Bae, Ju-Won ^{[1
]}

Lee, Soo-Hwan ^{[1
]}

Kim, Won-Yeol ^{[2
]}

Seong, Ju-Hyeon ^{[3
]}

Seo, Dong-Hoan ^{[4
]}

机构：

[1] Korea Maritime & Ocean Univ, Dept Elect & Elect Engn, Interdisciplinary Major Maritime AI Convergence, Busan 49112, South Korea

[2] Korea Maritime & Ocean Univ, Artificial Intelligence Convergence Res Ctr Reg I, Busan 49112, South Korea

[3] Korea Maritime & Ocean Univ, Dept Liberal Educ, Interdisciplinary Major Maritime AI Convergence, Busan 49112, South Korea

[4] Korea Maritime & Ocean Univ, Div Elect & Elect Informat Engn, Interdisciplinary Major Maritime AI Convergence, Busan 49112, South Korea

来源：

IEEE ACCESS | 2022年 / 10卷

基金：

新加坡国家研究基金会;

关键词：

Decoding; Predictive models; Visualization; Focusing; Feature extraction; Artificial intelligence; Vocabulary; Deep learning; image captioning; multimodal layer; part of speech;

D O I：

10.1109/ACCESS.2022.3169781

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image captions aim to generate human-like sentences that describe the image's content. Recent developments in deep learning (DL) have made it possible to caption images for accurate descriptions and detailed expressions. However, since DL learns the relationship between images and captions, it constructs sentences based on commonly frequented words in the dataset. Although these generated sentences are highly accurate, they have low lexical diversity, unlike humans due to limited vocabulary. Therefore, in this paper, we propose a Part-Of-Speech (POS) guidance module and a multimodal-based image captioning model that determines the intensity of images and word sequences and generates sentences through POS to enhance the lexical diversity of DL. The proposed POS guidance module enables rich expression by controlling the information of images and sequences based on the predicted POS guidance to predict words. Then, the POS multimodal layer adds POS and output vector of Bi-LSTM using the multimodal layer to predict the next caption, considering the grammatical structure. We trained and tested the proposed model on the Flicker 30K and MS COCO datasets and compared them with current state-of-the-art studies. Also, we analyzed the lexical diversity of the caption model through the Type-Token Ratio (TTR) and confirmed that the proposed model generates sentences using several words.

引用

页码：45219 / 45229

页数：11

共 55 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3]

Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65

[4] Meshed-Memory Transformer for Image Captioning [J].

Cornia, Marcella ;

Stefanini, Matteo ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6] Every Picture Tells a Story: Generating Sentences from Images [J].

Farhadi, Ali ;

Hejrati, Mohsen ;

Sadeghi, Mohammad Amin ;

Young, Peter ;

Rashtchian, Cyrus ;

Hockenmaier, Julia ;

Forsyth, David .

COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+

[7] Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style [J].

Ge, Hongwei ;

Yan, Zehang ;

Zhang, Kai ;

Zhao, Mingde ;

Sun, Liang .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :1754-1763

[8] Repeated review based image captioning for image evidence review [J].

Guan, Jinning ;

Wang, Eric .

SIGNAL PROCESSING-IMAGE COMMUNICATION, 2018, 63 :141-148

[9] Normalized and Geometry-Aware Self-Attention Network for Image Captioning [J].

Guo, Longteng ;

Liu, Jing ;

Zhu, Xinxin ;

Yao, Peng ;

Lu, Shichen ;

Lu, Hanqing .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10324-10333

[10]

He Kaiming, 2017, PROC IEEE INT C COMP

← 1 2 3 4 5 6 →