Video Activity Localisation with Uncertainties in Temporal Boundary

被引：14

作者：

Huang, Jiabo ^{[1
,4
]}

Jin, Hailin ^{[2
]}

Gong, Shaogang ^{[1
]}

Liu, Yang ^{[3
,5
]}

机构：

[1] Queen Mary Univ London, London, England

[2] Adobe Res, San Francisco, CA USA

[3] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China

[4] Vis Semant Ltd, London, England

[5] Beijing Inst Gen Artificial Intelligence, Beijing, Peoples R China

来源：

COMPUTER VISION, ECCV 2022, PT XXXIV | 2022年 / 13694卷

关键词：

D O I：

10.1007/978-3-031-19830-4_41

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current methods for video activity localisation over time assume implicitly that activity temporal boundaries labelled for model training are determined and precise. However, in unscripted natural videos, different activities mostly transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time. Such uncertainties in temporal labelling are currently ignored in model training, resulting in learning mis-matched video-text correlation with poor generalisation in test. In this work, we solve this problem by introducing Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries towards modelling universally interpretable video-text correlation with tolerance to underlying temporal uncertainties in pre-fixed annotations. Specifically, we construct elastic boundaries adaptively by mining and discovering frame-wise temporal endpoints that can maximise the alignment between video segments and query sentences. To enable both more accurate matching (segment content attention) and more robust localisation (segment elastic boundaries), we optimise the selection of frame-wise endpoints subject to segment-wise contents by a novel Guided Attention mechanism. Extensive experiments on three video activity localisation benchmarks demonstrate compellingly the EMB's advantages over existing methods without modelling uncertainty.

引用

页码：724 / 740

页数：17

共 33 条

[1]

[Anonymous], 2015, INT C LEARNING REPRE

[2]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4]

Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162

[5] TALL: Temporal Activity Localization via Language Query [J].

Gao, Jiyang ;

Sun, Chen ;

Yang, Zhenheng ;

Nevatia, Ram .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5277-5285

[6] MAC: Mining Activity Concepts for Language-based Temporal Localization [J].

Ge, Runzhou ;

Gao, Jiyang ;

Chen, Kan ;

Nevatia, Ram .

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :245-253

[7]

Ghosh S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1984

[8] Localizing Moments in Video with Natural Language [J].

Hendricks, Lisa Anne ;

Wang, Oliver ;

Shechtman, Eli ;

Sivic, Josef ;

Darrell, Trevor ;

Russell, Bryan .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5804-5813

[9] Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation [J].

Huang, Jiabo ;

Liu, Yang ;

Gong, Shaogang ;

Jin, Hailin .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7179-7188

[10] Dense-Captioning Events in Videos [J].

Krishna, Ranjay ;

Hata, Kenji ;

Ren, Frederic ;

Fei-Fei, Li ;

Niebles, Juan Carlos .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :706-715

← 1 2 3 4 →