Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement

被引:2
作者
Cai, Weitong [1 ]
Huang, Jiabo [1 ]
Hu, Jian [1 ]
Gong, Shaogang [1 ]
Jin, Hailin [2 ]
Liu, Yang [3 ]
机构
[1] Queen Mary Univ London, London, England
[2] Adobe Res, San Jose, CA 95110 USA
[3] Peking Univ, WICT, Beijing, Peoples R China
来源
2024 14TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION SYSTEMS, ICPRS | 2024年
关键词
Vision-Language Learning; Video Moment Retrieval; Temporal Feature Perturbation;
D O I
10.1109/ICPRS62101.2024.10677814
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video moment retrieval (VMR) aims to locate temporal activities in untrimmed videos by sentence queries, facing a temporal bias problem. VMR models tend to over-rely on statistical regularities, instead of cross-modal semantics, and perform poorly against different distributions. Existing attempts clip/reorder video segments or overlook some samples, leading to sample integrity break and information waste, which are inappropriate when utilizing limited VMR datasets from labor-intensive labeling. In this work, without sacrificing samples' inherent value to balance performances, we develop a novel Temporal feature Perturbation and Refinement (TPR) method to augment each sample. Specifically, we perturb frame features by manipulating their time-level statistics, to diversify temporal distributions and promote more generalizable cross-modal learning. Considering the plausible moment boundary shifts brought by perturbation, we further refine final predictions by augmenting timepoint labels to candidate endpoint sets with designed query triplets. Experiments show TPR's superiority on various temporal distributions.
引用
收藏
页数:7
相关论文
共 28 条
[1]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[2]  
Bao P., 2022, ICMR
[3]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[4]  
Cai W., 2022, BMVC
[5]  
Chen YW, 2021, ADV NEUR IN, V34
[6]   TALL: Temporal Activity Localization via Language Query [J].
Gao, Jiyang ;
Sun, Chen ;
Yang, Zhenheng ;
Nevatia, Ram .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5277-5285
[7]  
Gao Shang-Hua, 2021, CVPR
[8]   Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding [J].
Hao, Jiachang ;
Sun, Haifeng ;
Ren, Pengfei ;
Wang, Jingyu ;
Qi, Qi ;
Liao, Jianxin .
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 :130-147
[9]   Video Activity Localisation with Uncertainties in Temporal Boundary [J].
Huang, Jiabo ;
Jin, Hailin ;
Gong, Shaogang ;
Liu, Yang .
COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 :724-740
[10]   Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation [J].
Huang, Jiabo ;
Liu, Yang ;
Gong, Shaogang ;
Jin, Hailin .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7179-7188