Transform and Tell: Entity-Aware News Image Captioning

被引:49
作者
Tran, Alasdair [1 ]
Mathews, Alexander [1 ]
Xie, Lexing [1 ]
机构
[1] Australian Natl Univ, Canberra, ACT, Australia
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
基金
澳大利亚研究理事会;
关键词
D O I
10.1109/CVPR42600.2020.01305
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts. On the Good-News dataset [3], our model outperforms the previous state of the art by a factor of four in CIDEr score (13 -> 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.
引用
收藏
页码:13032 / 13042
页数:11
相关论文
共 35 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   VGGFace2: A dataset for recognising faces across pose and age [J].
Cao, Qiong ;
Shen, Li ;
Xie, Weidi ;
Parkhi, Omkar M. ;
Zisserman, Andrew .
PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :67-74
[3]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[6]  
Fan A, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P889
[7]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[8]   Automatic Caption Generation for News Images [J].
Feng, Yansong ;
Lapata, Mirella .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (04) :797-812
[9]   A New Readability Yardstick [J].
Flesch, Rudolf .
JOURNAL OF APPLIED PSYCHOLOGY, 1948, 32 (03) :221-233
[10]  
Gardner M, 2018, NLP OPEN SOURCE SOFTWARE (NLP-OSS), P1