Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

被引：44

作者：

Wang, Zheng ^{[1
]}

Chen, Jingjing ^{[1
]}

Jiang, Yu-Gang ^{[1
]}

机构：

[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch Comp Sci, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

关键词：

video retrieval; cross-modal interaction; noise contrastive learning; weakly-supervised; LOCALIZATION; LANGUAGE;

D O I：

10.1145/3474085.3475278

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video moment retrieval aims to localize the most relevant video moment given the text query. Weakly supervised approaches leverage video-text pairs only for training, without temporal annotations. Most current methods align the proposed video moment and the text in a joint embedding space. However, in lack of temporal annotations, the semantic gap between these two modalities makes it predominant to learn joint feature representation for most methods, with less emphasis on learning visual feature representation. This paper aims to improve the visual feature representation with supervisions in the visual domain, obtaining discriminative visual features for cross-modal learning. Based on the observation that relevant video moments (i.e., share similar activities) from different videos are commonly described by similar sentences; hence the visual features of these relevant video moments should also be similar despite that they come from different videos. Therefore, to obtain more discriminative and robust visual features for video moment retrieval, we propose to align the visual features of relevant video moments from different videos that co-occurred in the same training batch. Besides, a contrastive learning approach is introduced for learning the moment-level alignment of these videos. Through extensive experiments, we demonstrate that the proposed visual co-occurrence alignment learning method outperforms the cross-modal alignment learning counterpart and achieves promising results for video moment retrieval.

引用

页码：1459 / 1468

页数：10

共 53 条

[51] MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment [J].

Zhang, Da ;

Dai, Xiyang ;

Wang, Xin ;

Wang, Yuan-Fang ;

Davis, Larry S. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1247-1257

[52]

Zhang SY, 2020, AAAI CONF ARTIF INTE, V34, P12870

[53] Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos [J].

Zhang, Zhu ;

Lin, Zhijie ;

Zhao, Zhou ;

Zhu, Jieming ;

He, Xiuqiang .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :4098-4106

← 1 2 3 4 5 6 →