Improved image captioning with subword units training and transformer

被引：0

作者：

Cai Q. ^{[1
,2
,3
]}

Li J. ^{[1
,2
,3
]}

Li H. ^{[1
,2
,3
]}

Zuo M. ^{[1
,2
,3
]}

机构：

[1] School of Computer and Information Engineering, Beijing Techology and Business University, Beijing

[2] Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing

[3] National Engineering Laboratory for Agri-Product Quality Traceability, Beijing

来源：

High Technology Letters | 2020年 / 26卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Byte pair encoding (BPE); Image captioning; Reinforcement learning; Transformer;

D O I：

10.3772/j.issn.1006-6748.2020.02.011

中图分类号：

学科分类号：

摘要：

Image captioning models typically operate with a fixed vocabulary, but captioning is an open-vocabulary problem. Existing work addresses the image captioning of out-of-vocabulary words by labeling it as unknown in a dictionary. In addition, recurrent neural network (RNN) and its variants used in the caption task have become a bottleneck for their generation quality and training time cost. To address these 2 essential problems, a simpler but more effective approach is proposed for generating open-vocabulary caption, long short-term memory (LSTM) unit is replaced with transformer as decoder for better caption quality and less training time. The effectiveness of different word segmentation vocabulary and generation improvement of transformer over LSTM is discussed and it is proved that the improved models achieve state-of-the-art performance for the MSCOCO2014 image captioning tasks over a back-off dictionary baseline model. Copyright © by HIGH TECHNOLOGY LETTERS PRESS.

引用

页码：211 / 216

页数：5

共 23 条

[1]

Anderson P, He X, Buehler C, Et al., Bottom-up and top-down attention for image captioning and VQA, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5676-5685, (2018)

[2]

Lu J, Xiong C, Parikh D, Et al., Knowing when to look: adaptive attention via a visual sentinel for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, (2018)

[3]

Yang Z, Yuan Y, Wu Y, Et al., Review networks for caption generation, Advances in Neural Information Processing Systems, pp. 2361-2369, (2016)

[4]

Xu K, Ba J, Kiros R, Et al., Show, attend and tell: neural image caption generation with visual attention, (2015)

[5]

Lin T Y, Maire M, Belongie S, Et al., Microsoft COCO: common objects in context, European Conference on Computer Vision, pp. 740-755, (2016)

[6]

Sennrich R, Haddow B, Birch A., Neural machine translation of rare words with subword units, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, (2016)

[7]

Vaswani A, Shazeer N, Parmar N, Et al., Attention is all you need, Advances in Neural Information Processing Systems, pp. 5998-6008, (2017)

[8]

Devlin J, Chang M W, Lee K, Et al., Bert: pre-training of deep bidirectional transformers for language understanding, (2018)

[9]

Young T, Hazarika D, Poria S, Et al., Recent trends in deep learning based natural language processing[J], IEEE Computational Intelligence Magazine, 13, 3, pp. 55-75, (2018)

[10]

Young T, Hazarika D, Poria S, Et al., Recent trends in deep learning based natural language processing, Proceedings of 2017 IEEE International Conference on Computer Vision, pp. 55-75, (2017)

← 1 2 3 →