HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

被引:146
作者
Zhao, Hang [1 ]
Torralba, Antonio [1 ]
Torresani, Lorenzo [2 ]
Yan, Zhicheng [3 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] Dartmouth Coll, Hanover, NH 03755 USA
[3] Univ Illinois, Urbana, IL 61801 USA
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.00876
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Segments contains 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset both a large-scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transfer learning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1M as a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense temporal annotations.
引用
收藏
页码:8667 / 8677
页数:11
相关论文
共 61 条
[1]  
[Anonymous], 2017, Turn tap: Temporal unit regression network for temporal action proposals
[2]  
Bai Y., 2018, ARXIV180109184
[3]   What's the Point: Semantic Segmentation with Point Supervision [J].
Bearman, Amy ;
Russakovsky, Olga ;
Ferrari, Vittorio ;
Fei-Fei, Li .
COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 :549-565
[4]  
Buch S., 2017, P BRIT MACH VIS C BM
[5]   SST: Single-Stream Temporal Action Proposals [J].
Buch, Shyamal ;
Escorcia, Victor ;
Shen, Chuanqi ;
Ghanem, Bernard ;
Niebles, Juan Carlos .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6373-6382
[6]  
Carreira J, 2018, ARXIV
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[9]   Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [J].
Damen, Dima ;
Doughty, Hazel ;
Farinella, Giovanni Maria ;
Fidler, Sanja ;
Furnari, Antonino ;
Kazakos, Evangelos ;
Moltisanti, Davide ;
Munro, Jonathan ;
Perrett, Toby ;
Price, Will ;
Wray, Michael .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :753-771
[10]  
de Souza Cesar Roberto, 2017, Procedural generation of videos to train deep action recognition networks