Switchable Novel Object Captioner

被引:25
作者
Wu, Yu [1 ,2 ]
Jiang, Lu [3 ]
Yang, Yi [4 ]
机构
[1] Baidu Res, Beijing 100000, Peoples R China
[2] Princeton Univ, Sch Comp Sci, Princeton, NJ 08540 USA
[3] Google Res, Mountain View 94043, CA USA
[4] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Zhejiang, Peoples R China
关键词
Image captioning; novel object captioning; zero-shot learning;
D O I
10.1109/TPAMI.2022.3144984
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning aims at automatically describing images by sentences. It often requires lots of paired image-sentence data for training. However, trained captioning models can hardly be applied to new domains in which some novel words exist. In this paper, we introduce the zero-shot novel object captioning task, where the machine generates descriptions about novel objects without extra training sentences. To tackle the challenging task, we mimic the way that babies talk about something unknown, i.e., using the word of a similar known object. Following this motivation, we build a key-value object memory by detection models, containing visual information and corresponding words for objects in the image. For those novel objects, we use words of most similar seen objects as proxy visual words to solve the out-of-vocabulary issue. We then propose a Switchable LSTM that incorporates knowledge from the object memory into sentence generation. The model has two switchable working modes, i.e., 1) generating the sentences like standard LSTMs and 2) retrieving proper nouns from the key-value memory. Thus our model is learned to fully disentangle language generation from training objects, and requires zero training sentence in describing novel objects. Experiments on three large-scale datasets demonstrate the ability of our method to describe novel concepts.
引用
收藏
页码:1162 / 1173
页数:12
相关论文
共 51 条
[11]   Every Picture Tells a Story: Generating Sentences from Images [J].
Farhadi, Ali ;
Hejrati, Mohsen ;
Sadeghi, Mohammad Amin ;
Young, Peter ;
Rashtchian, Cyrus ;
Hockenmaier, Julia ;
Forsyth, David .
COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+
[12]   Cascaded Revision Network for Novel Object Captioning [J].
Feng, Qianyu ;
Wu, Yu ;
Fan, Hehe ;
Yan, Chenggang ;
Xu, Mingliang ;
Yang, Yi .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) :3413-3421
[13]   Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data [J].
Hendricks, Lisa Anne ;
Venugopalan, Subhashini ;
Rohrbach, Marcus ;
Mooney, Raymond ;
Saenko, Kate ;
Darrell, Trevor .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1-10
[14]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[15]   Speed/accuracy trade-offs for modern convolutional object detectors [J].
Huang, Jonathan ;
Rathod, Vivek ;
Sun, Chen ;
Zhu, Menglong ;
Korattikara, Anoop ;
Fathi, Alireza ;
Fischer, Ian ;
Wojna, Zbigniew ;
Song, Yang ;
Guadarrama, Sergio ;
Murphy, Kevin .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3296-+
[16]   Fast and Accurate Content-based Semantic Search in 100M Internet Videos [J].
Jiang, Lu ;
Yu, Shoou-I ;
Meng, Deyu ;
Yang, Yi ;
Mitamura, Teruko ;
Hauptmann, Alexander G. .
MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, :49-58
[17]   Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos [J].
Jiang, Lu ;
Yu, Shoou-I ;
Meng, Deyu ;
Mitamura, Teruko ;
Hauptmann, Alexander G. .
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, :27-34
[18]  
Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
[19]  
Kingma D. P., 2015, P 3 INT C LEARN REPR
[20]  
Kiros R, 2014, PR MACH LEARN RES, V32, P595