WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding

被引:13
作者
Li, Mengze [1 ]
Wang, Han [1 ]
Zhang, Wengiao [2 ]
Miao, Jiaxu [1 ]
Zhao, Zhou [1 ,3 ,4 ]
Zhang, Shengyu [1 ]
Ji, Wei [2 ]
Wu, Fei [1 ,3 ,4 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Zhejiang Univ, Shanghai Inst Adv Study, Hangzhou, Peoples R China
[4] Shanghai AI Lab, Shanghai, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.02211
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a language query. Existing techniques achieve such alignment by exploiting dense boundary and bounding box annotations, which can be prohibitively expensive. To bridge the gap, we investigate the weakly-supervised setting, where models learn from easily accessible video-language data without annotations. We identify that intra-sample spurious correlations among video-language components can be alleviated if the model captures the decomposed structures of video and language data. In this light, we propose a novel framework, namely WINNER, for hierarchical video-text understanding. WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos. The multi-modal decomposition tree serves as the basis for multi-hierarchy language-tube matching. A hierarchical contrastive learning objective is proposed to learn the multi-hierarchy correspondence and distinguishment with intra-sample and inter-sample video-text decomposition structures, achieving video-language decomposition structure alignment. Extensive experiments demonstrate the rationality of our design and its effectiveness beyond state-of-the-art weakly supervised methods, even some supervised methods.
引用
收藏
页码:23090 / 23099
页数:10
相关论文
共 57 条
  • [1] [Anonymous], 2020, ACM MM, DOI DOI 10.1145/3394171.3413518
  • [2] [Anonymous], 2018, P 2018 C N AM CHAP A, DOI DOI 10.1145/3233301
  • [3] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00857
  • [4] [Anonymous], 2020, OBJECT AWARE MULTIBR, DOI DOI 10.1109/CVPR42600.2020.00227
  • [5] Chen Junwen, 2020, ACM MM
  • [6] Drozdov Andrew, 2019, NAACL
  • [7] Gan Leilei, 2022, ACL
  • [8] Gao Mingfei., 2021, CVPR
  • [9] HaoWang Zheng-Jun Zha, 2020, ACM MM
  • [10] Hausler Stephen, 2021, CVPR