Switchable Novel Object Captioner

被引:25
作者
Wu, Yu [1 ,2 ]
Jiang, Lu [3 ]
Yang, Yi [4 ]
机构
[1] Baidu Res, Beijing 100000, Peoples R China
[2] Princeton Univ, Sch Comp Sci, Princeton, NJ 08540 USA
[3] Google Res, Mountain View 94043, CA USA
[4] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Zhejiang, Peoples R China
关键词
Image captioning; novel object captioning; zero-shot learning;
D O I
10.1109/TPAMI.2022.3144984
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning aims at automatically describing images by sentences. It often requires lots of paired image-sentence data for training. However, trained captioning models can hardly be applied to new domains in which some novel words exist. In this paper, we introduce the zero-shot novel object captioning task, where the machine generates descriptions about novel objects without extra training sentences. To tackle the challenging task, we mimic the way that babies talk about something unknown, i.e., using the word of a similar known object. Following this motivation, we build a key-value object memory by detection models, containing visual information and corresponding words for objects in the image. For those novel objects, we use words of most similar seen objects as proxy visual words to solve the out-of-vocabulary issue. We then propose a Switchable LSTM that incorporates knowledge from the object memory into sentence generation. The model has two switchable working modes, i.e., 1) generating the sentences like standard LSTMs and 2) retrieving proper nouns from the key-value memory. Thus our model is learned to fully disentangle language generation from training objects, and requires zero training sentence in describing novel objects. Experiments on three large-scale datasets demonstrate the ability of our method to describe novel concepts.
引用
收藏
页码:1162 / 1173
页数:12
相关论文
共 51 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   nocaps: novel object captioning at scale [J].
Agrawal, Harsh ;
Desai, Karan ;
Wang, Yufei ;
Chen, Xinlei ;
Jain, Rishabh ;
Johnson, Mark ;
Batra, Dhruv ;
Parikh, Devi ;
Lee, Stefan ;
Anderson, Peter .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8947-8956
[3]  
Anderson P., 2017, P 2017 C EMPIRICAL M, P936
[4]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[5]  
[Anonymous], 2015, INPROC INT C LEARN R
[6]  
[Anonymous], 2012, P 13 C EUR CHAPT ASS
[7]  
Banerjee S, 2005, P ACL WORKSH INTR EX, P65
[8]  
Bengio S, 2015, ADV NEUR IN, V28
[9]  
Cao TJ, 2020, AAAI CONF ARTIF INTE, V34, P10494
[10]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878