Exploring Language Hierarchy for Video Grounding

被引：13

作者：

Ding, Xinpeng ^{[1
,2
]}

Wang, Nannan ^{[3
]}

Zhang, Shiwei ^{[4
]}

Huang, Ziyuan ^{[5
]}

Li, Xiaomeng ^{[2
,6
]}

Tang, Mingqian ^{[4
]}

Liu, Tongliang ^{[7
]}

Gao, Xinbo ^{[8
]}

机构：

[1] Xidian Univ, Sch Elect Engn, State Key Lab Integrated Serv Networks, Xian 710071, Shaanxi, Peoples R China

[2] Hong Kong Univ Sci & Technol, Dept Elect & Comp Engn, Hong Kong, Peoples R China

[3] Xidian Univ, Sch Telecommun Engn, State Key Lab Integrated Serv Networks, Xian 710071, Shaanxi, Peoples R China

[4] Alibaba Grp, Hangzhou 311100, Zhejiang, Peoples R China

[5] Natl Univ Singapore, Adv Robot Ctr, Singapore 117543, Singapore

[6] Hong Kong Univ Sci & Technol, Shenzhen Res Inst, Shenzhen 518057, Peoples R China

[7] Univ Sydney, Sch Comp Sci, Trustworthy Machine Learning Lab, Fac Engn, Sydney, NSW 2006, Australia

[8] Chongqing Univ Posts & Telecommun, Chongqing Key Lab Image Cognit, Chongqing 400065, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Video and language; video understanding; language hierarchy; LOCALIZATION; PROPOSAL;

D O I：

10.1109/TIP.2022.3187288

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The understanding of language plays a key role in video grounding, where a target moment is localized according to a text query. From a biological point of view, language is naturally hierarchical, with the main clause (predicate phrase) providing coarse semantics and modifiers providing detailed descriptions. In video grounding, moments described by the main clause may exist in multiple clips of a long video, including both the ground-truth and background clips. Therefore, in order to correctly discriminate the ground-truth clip from the background ones, this co-existence leads to the negligence of the main clause, and concentrate the model on the modifiers that provide discriminative information on distinguishing the target proposal from the others. We first demonstrate this phenomenon empirically, and propose a Hierarchical Language Network (HLN) that exploits the language hierarchy, as well as a new learning approach called Multi-Instance Positive-Unlabelled Learning (MI-PUL) to alleviate the above problem. Specifically, in HLN, the localization is performed on various layers of the language hierarchy, so that the attention can be paid to different parts of the sentences, rather than only discriminative ones. Furthermore, MI-PUL allows the model to localize background clips that can be possibly described by the main clause, even without manual annotations. Therefore, the union of the two proposed components enhances the learning of the main clause, which is of critical importance in video grounding. Finally, we evaluate that our proposed HLN can plug into the current methods and improve their performance. Extensive experiments on challenging datasets show HLN significantly improve the state-of-the-art methods, especially achieving 6.15% gain in terms of Recal l1@ IoU0.5 on the TACoS dataset.

引用

页码：4693 / 4706

页数：14

共 80 条

[71]

Yuan Y., 2019, PROC NEURIPS, P536

[72]

Yuan YT, 2019, AAAI CONF ARTIF INTE, P9159

[73] Dense Regression Network for Video Grounding [J].

Zeng, Runhao ;

Xu, Haoming ;

Huang, Wenbing ;

Chen, Peihao ;

Tan, Mingkui ;

Gan, Chuang .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10284-10293

[74]

Zhang B., 2008, PROC INT SYMPOSIUMS, V27, P703

[75]

Zhang Hao, 2022, ARXIV220108071

[76]

Zhang S., 2019, ARXIV COMPUTER VISIO

[77] Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos [J].

Zhang, Zhu ;

Lin, Zhijie ;

Zhao, Zhou ;

Xiao, Zhenxin .

PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :655-664

[78] Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding [J].

Zhou, Hao ;

Zhang, Chongyang ;

Luo, Yan ;

Chen, Yanjun ;

Hu, Chuanping .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8441-8450

[79]

Zhou Z. H., 2004, Multi-Instance Learning: A Survey

[80] Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks [J].

Zhu, Fengda ;

Zhu, Yi ;

Chang, Xiaojun ;

Liang, Xiaodan .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10009-10019

← 1 2 3 4 5 6 7 8 →