Dense Regression Network for Video Grounding

被引:194
作者
Zeng, Runhao [1 ,3 ]
Xu, Haoming [1 ]
Huang, Wenbing [4 ]
Chen, Peihao [1 ]
Tan, Mingkui [1 ]
Gan, Chuang [2 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[2] MIT IBM Watson AI Lab, Cambridge, MA USA
[3] Peng Cheng Lab, Shenzhen, Peoples R China
[4] Tsinghua Univ, Dept Comp Sci & Technol, Beijing Natl Res Ctr Informat Sci & Technol BNRis, Beijing, Peoples R China
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR42600.2020.01030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We address the problem of video grounding from natural language queries. The key challenge in this task is that one training video might only contain a few annotated starting/ending frames that can be used as positive examples for model training. Most conventional approaches directly train a binary classifier using such imbalance data, thus achieving inferior results. The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy. Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment described by the query. We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results (i.e., the IoU between the predicted location and the ground truth). Experimental results show that our approach significantly outperforms state-of-the-arts on three datasets (i.e., Charades-STA, ActivityNet-Captions, and TACoS).
引用
收藏
页码:10284 / 10293
页数:10
相关论文
共 51 条
[1]  
[Anonymous], 2019, ADV NEURAL INFORM PR, DOI DOI 10.1109/COMPSAC.2019.00109
[2]  
[Anonymous], 2019, TMM
[3]   JBench: A Dataset of Data Races for Concurrency Testing [J].
Cao, Jian ;
Yang, Xin ;
Jiang, Yu ;
Liu, Han ;
Ying, Weiliang ;
Zhang, Xian .
2018 IEEE/ACM 15TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2018, :6-9
[4]  
Cao Jiezhang, 2019, ADV NEURAL INFORM PR, P1774
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]  
Chen JY, 2019, AAAI CONF ARTIF INTE, P8175
[7]  
Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[8]  
Chen Peihao, 2019, IEEE T MULTIMEDIA
[9]  
Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
[10]   Dual Encoding for Zero-Example Video Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Ji, Shouling ;
He, Yuan ;
Yang, Gang ;
Wang, Xun .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347