Sequential Transformer via an Outside-In Attention for image captioning

被引:23
作者
Wei, Yiwei [1 ,2 ]
Wu, Chunlei [3 ]
Li, Guohe [1 ]
Shi, Haitao [1 ]
机构
[1] China Univ Petr Beijing Karamay, Sch Petr Engn, Karamay, Peoples R China
[2] China Univ Petr Beijing Karamay, Oil & Gas Big Data Integrated Lab, Karamay, Peoples R China
[3] China Univ Petr, Coll Comp Sci & Technol, Qingdao, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Self attention; Recurrent network; Transformer;
D O I
10.1016/j.engappai.2021.104574
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Attention-based approaches have been firmly established the state of the art in image captioning tasks. However, both the recurrent attention in recurrent neural network (RNN) and the self attention in transformer have limitations. Recurrent attention only takes the external state to decide where to look, while ignoring to discover the internal relationships between image regions. Self attention is just the opposite. To fill this gap, we firstly introduce an Outside-in Attention that makes the external state participate in the interaction of the image regions. And, it prompts the model to learn the dependency inside the image regions, as well as the dependency between image regions and the external state. Then, we investigate a Sequential Transformer Framework (S-Transformer) based on the original Transformer structure, where the decoder is incorporated with the Outside-in Attention and RNN. This framework can help the model to inherit the advantages of both the transformer and recurrent network in sequence modeling. When tested on COCO dataset, the proposed approaches achieve competitive results in single-model and ensemble configurations on both MSCOCO Karpathy test split and the online test server.
引用
收藏
页数:8
相关论文
共 43 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
[Anonymous], 2014, Comput. Sci.
[4]  
[Anonymous], 2014, Grounded compositional semantics for finding and describing images with sentences
[5]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[6]  
Denkowski M., 2014, P 9 WORKSH STAT MACH, P376
[7]   FINDING STRUCTURE IN TIME [J].
ELMAN, JL .
COGNITIVE SCIENCE, 1990, 14 (02) :179-211
[8]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[9]   Every Picture Tells a Story: Generating Sentences from Images [J].
Farhadi, Ali ;
Hejrati, Mohsen ;
Sadeghi, Mohammad Amin ;
Young, Peter ;
Rashtchian, Cyrus ;
Hockenmaier, Julia ;
Forsyth, David .
COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+
[10]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]