Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction

被引:33
作者
Lin, Zhijie [1 ]
Zhao, Zhou [1 ,2 ]
Zhang, Zhu [1 ]
Zhang, Zijian [1 ]
Cai, Deng [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Peoples R China
[2] Alibaba Zhejiang Univ, Joint Res Inst Frontier Technol, Hangzhou 310058, Peoples R China
[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China
基金
中国国家自然科学基金; 浙江省自然科学基金;
关键词
Moment retrieval; syntactic GCN; multi-head self-attention; multi-stage cross-modal interaction; query reconstruction;
D O I
10.1109/TIP.2020.2965987
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including the syntactic dependencies of natural language queries, long-range semantic dependencies in video context and the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning and propose a multi-head self-attention to capture long-range semantic dependencies from video context. Next, we employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents, and we also consider query reconstruction from the cross-modal representations of target moment as an auxiliary task to strengthen the cross-modal representations. The extensive experiments on ActivityNet Captions and TACoS demonstrate the effectiveness of our proposed method.
引用
收藏
页码:3750 / 3762
页数:13
相关论文
共 53 条
  • [1] Unsupervised Learning from Narrated Instruction Videos
    Alayrac, Jean-Baptiste
    Bojanowski, Piotr
    Agrawal, Nishant
    Sivic, Josef
    Laptev, Ivan
    Lacoste-Julien, Simon
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4575 - 4583
  • [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [3] [Anonymous], P ACM INT C IM VID R
  • [4] [Anonymous], 2017, P IEEE INT C COMP VI
  • [5] [Anonymous], 2013, T ASSOC COMPUT LING
  • [6] [Anonymous], P EMNLP
  • [7] [Anonymous], 2017, P 31 INT C NEURAL IN
  • [8] Weakly-Supervised Alignment of Video With Text
    Bojanowski, P.
    Lajugie, R.
    Grave, E.
    Bach, F.
    Laptev, I.
    Ponce, J.
    Schmid, C.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4462 - 4470
  • [9] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [10] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
    Chao, Yu-Wei
    Vijayanarasimhan, Sudheendra
    Seybold, Bryan
    Ross, David A.
    Deng, Jia
    Sukthankar, Rahul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1130 - 1139