WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding

被引：13

作者：

Li, Mengze ^{[1
]}

Wang, Han ^{[1
]}

Zhang, Wengiao ^{[2
]}

Miao, Jiaxu ^{[1
]}

Zhao, Zhou ^{[1
,3
,4
]}

Zhang, Shengyu ^{[1
]}

Ji, Wei ^{[2
]}

Wu, Fei ^{[1
,3
,4
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

[3] Zhejiang Univ, Shanghai Inst Adv Study, Hangzhou, Peoples R China

[4] Shanghai AI Lab, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52729.2023.02211

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a language query. Existing techniques achieve such alignment by exploiting dense boundary and bounding box annotations, which can be prohibitively expensive. To bridge the gap, we investigate the weakly-supervised setting, where models learn from easily accessible video-language data without annotations. We identify that intra-sample spurious correlations among video-language components can be alleviated if the model captures the decomposed structures of video and language data. In this light, we propose a novel framework, namely WINNER, for hierarchical video-text understanding. WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos. The multi-modal decomposition tree serves as the basis for multi-hierarchy language-tube matching. A hierarchical contrastive learning objective is proposed to learn the multi-hierarchy correspondence and distinguishment with intra-sample and inter-sample video-text decomposition structures, achieving video-language decomposition structure alignment. Extensive experiments demonstrate the rationality of our design and its effectiveness beyond state-of-the-art weakly supervised methods, even some supervised methods.

引用

页码：23090 / 23099

页数：10

共 57 条

[1] [Anonymous], 2020, ACM MM, DOI DOI 10.1145/3394171.3413518
[2] [Anonymous], 2018, P 2018 C N AM CHAP A, DOI DOI 10.1145/3233301
[3] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00857
[4] [Anonymous], 2020, OBJECT AWARE MULTIBR, DOI DOI 10.1109/CVPR42600.2020.00227
[5] Chen Junwen, 2020, ACM MM
[6] Drozdov Andrew, 2019, NAACL
[7] Gan Leilei, 2022, ACL
[8] Gao Mingfei., 2021, CVPR
[9] HaoWang Zheng-Jun Zha, 2020, ACM MM
[10] Hausler Stephen, 2021, CVPR

← 1 2 3 4 5 6 →