Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

被引：44

作者：

Wang, Zheng ^{[1
]}

Chen, Jingjing ^{[1
]}

Jiang, Yu-Gang ^{[1
]}

机构：

[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch Comp Sci, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

关键词：

video retrieval; cross-modal interaction; noise contrastive learning; weakly-supervised; LOCALIZATION; LANGUAGE;

D O I：

10.1145/3474085.3475278

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video moment retrieval aims to localize the most relevant video moment given the text query. Weakly supervised approaches leverage video-text pairs only for training, without temporal annotations. Most current methods align the proposed video moment and the text in a joint embedding space. However, in lack of temporal annotations, the semantic gap between these two modalities makes it predominant to learn joint feature representation for most methods, with less emphasis on learning visual feature representation. This paper aims to improve the visual feature representation with supervisions in the visual domain, obtaining discriminative visual features for cross-modal learning. Based on the observation that relevant video moments (i.e., share similar activities) from different videos are commonly described by similar sentences; hence the visual features of these relevant video moments should also be similar despite that they come from different videos. Therefore, to obtain more discriminative and robust visual features for video moment retrieval, we propose to align the visual features of relevant video moments from different videos that co-occurred in the same training batch. Besides, a contrastive learning approach is introduced for learning the moment-level alignment of these videos. Through extensive experiments, we demonstrate that the proposed visual co-occurrence alignment learning method outperforms the cross-modal alignment learning counterpart and achieves promising results for video moment retrieval.

引用

页码：1459 / 1468

页数：10

共 53 条

[1]

Alwassel Humam, 2019, ARXIV191112667

[2]

Apostolidis Evlampios, 2021, ARXIV210106072

[3]

Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162

[4]

Chen Zhenfang, 2020, ARXIV200109308

[5]

Cho K., 2014, P 8 WORKSH SYNT SEM, DOI [10.3115/v1/W14-4012, DOI 10.3115/V1/W14-4012]

[6] DynamoNet: Dynamic Action and Motion Network [J].

Diba, Ali ;

Sharma, Vivek ;

Van Gool, Luc ;

Stiefelhagen, Rainer .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6191-6200

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8]

Duan Xuguang, 2018, ARXIV181203849

[9] Temporal Cycle-Consistency Learning [J].

Dwibedi, Debidatta ;

Aytar, Yusuf ;

Tompson, Jonathan ;

Sermanet, Pierre ;

Zisserman, Andrew .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1801-1810

[10] DAPs: Deep Action Proposals for Action Understanding [J].

Escorcia, Victor ;

Heilbron, Fabian Caba ;

Niebles, Juan Carlos ;

Ghanem, Bernard .

COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :768-784

← 1 2 3 4 5 6 →