Imageability- and Length-Controllable Image Captioning

被引:6
作者
Kastner, Marc A. [1 ]
Umemura, Kazuki [2 ]
Ide, Ichiro [2 ,3 ]
Kawanishi, Yasutomo [2 ,4 ]
Hirayama, Takatsugu [2 ,5 ]
Doman, Keisuke [6 ]
Deguchi, Daisuke [2 ]
Murase, Hiroshi [2 ]
Satoh, Shin'Ichi [1 ]
机构
[1] Natl Inst Informat, Digital Content & Media Sci Res Div, Chiyoda Ku, Tokyo 1018430, Japan
[2] Nagoya Univ, Grad Sch Informat, Chikusa Ku, Nagoya, Aichi 4648601, Japan
[3] Nagoya Univ, Math & Data Sci Ctr, Chikusa Ku, Nagoya, Aichi 4648601, Japan
[4] RIKEN, Guardian Robot Project, Informat Res & Dev & Strategy Headquarters, Seika, Kyoto 6190288, Japan
[5] Univ Human Environm, Fac Human Environm, Okazaki, Aichi 4443505, Japan
[6] Chukyo Univ, Sch Engn, Toyota, Aichi 4700393, Japan
来源
IEEE ACCESS | 2021年 / 9卷 / 09期
关键词
Visualization; Transformers; Task analysis; Sports; Licenses; Informatics; Training; Machine learning; semantics; task analysis; image captioning; psycholinguistics; DATABASE; RATINGS;
D O I
10.1109/ACCESS.2021.3131393
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image captioning can show great performance for generating captions for general purposes, but it remains difficult to adjust the generated captions for different applications. In this paper, we propose an image captioning method which can generate both imageability- and length-controllable captions. The imageability parameter adjusts the level of visual descriptiveness of the caption, making it either more abstract or more concrete. In contrast, the length parameter only adjusts the length of the caption while keeping the visual descriptiveness on a similar degree. Based on a transformer architecture, our model is trained using an augmented dataset with diversified captions across different degrees of descriptiveness. The resulting model can control both imageability and length, making it possible to tailor output towards various applications. Experiments show that we can maintain a captioning performance similar to comparison methods, while being able to control the visual descriptiveness and the length of the generated captions. A subjective evaluation with human participants also shows a significant correlation of the target imageability in terms of human expectations. Thus, we confirmed that the proposed method provides a promising step towards tailoring image captions closer to certain applications.
引用
收藏
页码:162951 / 162961
页数:11
相关论文
共 47 条
  • [1] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [2] Chaorui Deng, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12358), P712, DOI 10.1007/978-3-030-58601-0_42
  • [3] Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998
  • [4] "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention
    Chen, Tianlang
    Zhang, Zhongping
    You, Quanzeng
    Fang, Chen
    Wang, Zhaowen
    Jin, Hailin
    Luo, Jiebo
    [J]. COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 : 527 - 543
  • [5] Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
  • [6] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8299 - 8308
  • [7] Imageability ratings for 3,000 monosyllabic words
    Cortese, MJ
    Fugett, A
    [J]. BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 2004, 36 (03): : 384 - 387
  • [8] Denkowski M., 2014, P 9 WORKSHOP STAT MA, P376
  • [9] Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech
    Deshpande, Aditya
    Aneja, Jyoti
    Wang, Liwei
    Schwing, Alexander
    Forsyth, David
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10687 - 10696
  • [10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171