Movie fill in the blank by joint learning from video and text with adaptive temporal attention

被引:8
作者
Chen, Jie [1 ,2 ]
Shao, Jie [1 ,2 ]
He, Chengkun [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
基金
中国国家自然科学基金;
关键词
Video question answering; Adaptive temporal attention; Text information fusion;
D O I
10.1016/j.patrec.2018.06.030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video understanding is a challenging problem and it attracts a lot of research attention. Lately, a new task called movie fill in the blank (MovieFIB) is proposed. In this task, given a movie clip and a description which has one blank, we need to predict the word in the blank accurately. Previous studies make many contributions to tackling this problem. However, some of them do not utilize the relationship between words and video frames, and some others treat visual information as essential elements for blank word prediction, which fail to distinguish the effects of texts before and after the blank. To overcome the limitations, in this paper we propose to use adaptive temporal attention and fuse text information with attention. We first extract video and word features. Then, adaptive temporal attention is used to update original description. For the updated description, we extract its text information. Attention mechanism is applied to fuse text information. Finally, we use adaptive temporal attention to predict the blank word. Extensive experiments demonstrate that our model achieves satisfactory performance. (c) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:62 / 68
页数:7
相关论文
共 30 条
[1]  
[Anonymous], 2016, ARXIV161004062
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]   Movie Fill in the Blank with Adaptive Temporal Attention and Description Update [J].
Chen, Jie ;
Shao, Jie ;
Shen, Fumin ;
He, Chengkun ;
Gao, Lianli ;
Shen, Heng Tao .
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, :1039-1048
[4]  
Cooijmans T., 2016, CoRR
[5]  
Corrado G., 2013, WORKSH P INT C LEARN, V1301, P3781
[6]   Video Captioning With Attention-Based LSTM and Semantic Consistency [J].
Gao, Lianli ;
Guo, Zhao ;
Zhang, Hanwang ;
Xu, Xing ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) :2045-2055
[7]   YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition [J].
Guadarrama, Sergio ;
Krishnamoorthy, Niveda ;
Malkarnenkar, Girish ;
Venugopalan, Subhashini ;
Mooney, Raymond ;
Darrell, Trevor ;
Saenko, Kate .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :2712-2719
[8]   Exploiting score distribution for heterogenous feature fusion in image classification [J].
He, Chengkun ;
Shao, Jie ;
Xu, Xing ;
Ouyang, Deqiang ;
Gao, Lianli .
NEUROCOMPUTING, 2017, 253 :70-76
[9]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[10]   IF-Matching: Towards Accurate Map-Matching with Information Fusion [J].
Hu, Gang ;
Shao, Jie ;
Liu, Fenglin ;
Wang, Yuan ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (01) :114-127