Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning

被引:35
作者
Ji, Jiayi [1 ]
Ma, Yiwei [1 ]
Sun, Xiaoshuai [1 ,2 ]
Zhou, Yiyi [1 ]
Wu, Yongjian [3 ]
Ji, Rongrong [1 ,2 ,4 ]
机构
[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Inst Artificial Intelligence, Xiamen 361005, Peoples R China
[3] Tencent, Youtu Lab, Shanghai 200233, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金;
关键词
Integrated circuit modeling; Visualization; Training; Task analysis; Measurement; Transformers; Computational modeling; Image captioning; metric-oriented focal mechanism; Effective CIDEr;
D O I
10.1109/TIP.2022.3183434
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite considerable progress, image captioning still suffers from the huge difference in quality between easy and hard examples, which is left unexploited in existing methods. To address this issue, we explore the hard example mining in image captioning, and propose a simple yet effective mechanism to instruct the model to pay more attention to hard examples, thereby improving the performance in both general and complex scenarios. We first propose a novel learning strategy, termed Metric-oriented Focal Mechanism (MFM), for hard example mining in image captioning. Differing from the existing strategies for classification tasks, MFM can adopt the generative metrics of image captioning to measure the difficulties of examples, and then up-weight the rewards of hard examples during training. To make MFM applicable to different datasets without tedious parameter tuning, we further introduce an adaptive reward metric called Effective CIDEr (ECIDEr), which considers the data distribution of easy and hard examples during reward estimation. Extensive experiments are conducted on the MS COCO benchmark, and the results show that while maintaining the performance on simple examples, MFM can significantly improve the quality of captions for hard examples. The ECIDEr-based MFM is equipped on the current SOTA method, e.g., DLCT (Luo et al., 2021), which outperforms all existing methods and achieves new state-of-the-art performance on both the off-line and on- line testing, i.e., 134.3 CIDEr for the off-line testing and 136.1 for the on- line testing of MSCOCO. To validate the generalization ability of ECIDEr-based MFM, we also apply it to another dataset, namely Flickr30k, and superior performance gains can also be obtained.
引用
收藏
页码:4321 / 4335
页数:15
相关论文
共 93 条
[41]  
Li BY, 2019, AAAI CONF ARTIF INTE, P8577
[42]   Entangled Transformer for Image Captioning [J].
Li, Guang ;
Zhu, Linchao ;
Liu, Ping ;
Yang, Yi .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8927-8936
[43]  
Li X., 2020, P EUR C COMP VIS, P121
[44]   Know More Say Less: Image Captioning Based on Scene Graphs [J].
Li, Xiangyang ;
Jiang, Shuqiang .
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (08) :2117-2130
[45]  
Lin C.-Y., 2004, TEXT SUMMARIZATION B, P74
[46]   Focal Loss for Dense Object Detection [J].
Lin, Tsung-Yi ;
Goyal, Priya ;
Girshick, Ross ;
He, Kaiming ;
Dollar, Piotr .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2999-3007
[47]   Microsoft COCO: Common Objects in Context [J].
Lin, Tsung-Yi ;
Maire, Michael ;
Belongie, Serge ;
Hays, James ;
Perona, Pietro ;
Ramanan, Deva ;
Dollar, Piotr ;
Zitnick, C. Lawrence .
COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755
[48]  
Liu F., 2019, ARXIV190506139
[49]  
Liu FL, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P5095
[50]   Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning [J].
Lu, Jiasen ;
Xiong, Caiming ;
Parikh, Devi ;
Socher, Richard .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3242-3250