Cross Interaction Network for Natural Language Guided Video Moment Retrieval

被引:14
作者
Yu, Xinli [1 ]
Malmir, Mohsen [2 ]
He, Xin [2 ]
Chen, Jiangning [2 ]
Wang, Tong [2 ]
Wu, Yue [2 ]
Liu, Yue [2 ]
Liu, Yang [2 ]
机构
[1] Temple Univ, Philadelphia, PA 19122 USA
[2] Amazon Com, Boston, MA USA
来源
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2021年
关键词
Information retrieval; Natural language guided; Video moment retrieval; Cross attention; Self attention;
D O I
10.1145/3404835.3463021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Natural language query grounding in videos is a challenging task that requires comprehensive understanding of the query, video and fusion of information across these modalities. Existing methods mostly emphasize on the query-to-video one-way interaction with a late fusion scheme, lacking effective ways to capture the relationship within and between query and video in a fine-grained manner. Moreover, current methods are often overly complicated resulting in long training time. We propose a self-attention together with cross interaction multi-head-attention mechanism in an early fusion scheme to capture video-query intra-dependencies as well as interrelation from both directions (query-to-video and video-to-query). The cross-attention method can associate query words and video frames at any position and account for long-range dependencies in the video context. In addition, we propose a multi-task training objective that includes start/end prediction and moment segmentation. The moment segmentation task provides additional training signals that remedy the start/end prediction noise caused by annotator disagreement. Our simple yet effective architecture enables speedy training (within 1 hour on an AWS P3.2xlarge GPU instance) and instant inference. We showed that the proposed method achieves superior performance compared to complex state of the art methods, in particular surpassing the SOTA on high IoU metrics (R@1, IoU=0.7) by 3.52% absolute (11.09% relative) on the Charades-STA dataset.
引用
收藏
页码:1860 / 1864
页数:5
相关论文
共 20 条
  • [11] Lu JS, 2019, ADV NEUR IN, V32
  • [12] Weakly Supervised Video Moment Retrieval From Text Queries
    Mithun, Niluthpol Chowdhury
    Paul, Sujoy
    Roy-Chowdhury, Amit K.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 11584 - 11593
  • [13] Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
    Sigurdsson, Gunnar A.
    Varol, Gul
    Wang, Xiaolong
    Farhadi, Ali
    Laptev, Ivan
    Gupta, Abhinav
    [J]. COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 : 510 - 526
  • [14] LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval
    Tan, Reuben
    Xu, Huijuan
    Saenko, Kate
    Plummer, Bryan A.
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 2082 - 2091
  • [15] Learning Deep Structure-Preserving Image-Text Embeddings
    Wang, Liwei
    Li, Yin
    Lazebnik, Svetlana
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5005 - 5013
  • [16] Xu HJ, 2019, AAAI CONF ARTIF INTE, P9062
  • [17] Zeng R., 2020, P IEEE CVF C COMP VI
  • [18] MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment
    Zhang, Da
    Dai, Xiyang
    Wang, Xin
    Wang, Yuan-Fang
    Davis, Larry S.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1247 - 1257
  • [19] Zhang H., 2020, ANN M ASSOCCOMPUT LI, P6543
  • [20] Zheng Zhedong, 2017, ARXIV171105535