LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval

被引：45

作者：

Tan, Reuben ^{[1
]}

Xu, Huijuan ^{[2
]}

Saenko, Kate ^{[1
,3
]}

Plummer, Bryan A. ^{[1
]}

机构：

[1] Boston Univ, Boston, MA 02215 USA

[2] Univ Calif Berkeley, Berkeley, CA 94720 USA

[3] MIT IBM Watson AI Lab, Cambridge, MA USA

来源：

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 | 2021年

关键词：

LOCALIZATION;

D O I：

10.1109/WACV48630.2021.00213

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 520% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.

引用

页码：2082 / 2091

页数：10

共 41 条

[1] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
Ben-younes, Hedi
Cadene, Remi
Cord, Matthieu
Thome, Nicolas
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
[2] MUREL: Multimodal Relational Reasoning for Visual Question Answering
Cadene, Remi
Ben-younes, Hedi
Cord, Matthieu
Thome, Nicolas
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
[3] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[4] Query-guided Regression Network with Context Policy for Phrase Grounding
Chen, Kan
Kovvuri, Rama
Nevatia, Ram
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 824 - 832
[5] Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
[6] Cho K., 2014, P C EMP METH NAT LAN, P1724, DOI DOI 10.3115/V1/D14-1179
[7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[9] Faghri F, 2018, P BRIT MACH VIS C
[10] TALL: Temporal Activity Localization via Language Query
Gao, Jiyang
Sun, Chen
Yang, Zhenheng
Nevatia, Ram
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5277 - 5285

← 1 2 3 4 5 →