LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval

被引:45
作者
Tan, Reuben [1 ]
Xu, Huijuan [2 ]
Saenko, Kate [1 ,3 ]
Plummer, Bryan A. [1 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Univ Calif Berkeley, Berkeley, CA 94720 USA
[3] MIT IBM Watson AI Lab, Cambridge, MA USA
来源
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 | 2021年
关键词
LOCALIZATION;
D O I
10.1109/WACV48630.2021.00213
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 520% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.
引用
收藏
页码:2082 / 2091
页数:10
相关论文
共 41 条
  • [1] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
    Ben-younes, Hedi
    Cadene, Remi
    Cord, Matthieu
    Thome, Nicolas
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
  • [2] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
  • [3] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
  • [4] Query-guided Regression Network with Context Policy for Phrase Grounding
    Chen, Kan
    Kovvuri, Rama
    Nevatia, Ram
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 824 - 832
  • [5] Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
  • [6] Cho K., 2014, P C EMP METH NAT LAN, P1724, DOI DOI 10.3115/V1/D14-1179
  • [7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [8] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [9] Faghri F, 2018, P BRIT MACH VIS C
  • [10] TALL: Temporal Activity Localization via Language Query
    Gao, Jiyang
    Sun, Chen
    Yang, Zhenheng
    Nevatia, Ram
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5277 - 5285