Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

被引：215

作者：

Damen, Dima ^{[1
]}

Doughty, Hazel ^{[1
,3
]}

Farinella, Giovanni Maria ^{[2
]}

Furnari, Antonino ^{[2
]}

Kazakos, Evangelos ^{[1
]}

Ma, Jian ^{[1
]}

Moltisanti, Davide ^{[1
,4
]}

Munro, Jonathan ^{[1
]}

Perrett, Toby ^{[1
]}

Price, Will ^{[1
]}

Wray, Michael ^{[1
]}

机构：

[1] Univ Bristol, Bristol, Avon, England

[2] Univ Catania, Catania, Italy

[3] Univ Amsterdam, Amsterdam, Netherlands

[4] Nanyang Technol Univ, Singapore, Singapore

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2022年 / 130卷 / 01期

基金：

英国工程与自然科学研究理事会;

关键词：

Video dataset; Egocentric vision; First-person vision; Action understanding; Multi-benchmark large-scale dataset; Annotation quality; DOMAIN ADAPTATION;

D O I：

10.1007/s11263-021-01531-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version (Damen in Scaling egocentric vision: ECCV, 2018), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the "test of time"-i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics.

引用

页码：33 / 55

页数：23

共 129 条

[1]

[Anonymous], 2016, HUMAN ACTION LOCALIZ

[2]

[Anonymous], 2008, Guide to the carnegie mellon university multimodal activity (cmummac) database

[3] What's the Point: Semantic Segmentation with Point Supervision [J].

Bearman, Amy ;

Russakovsky, Olga ;

Ferrari, Vittorio ;

Fei-Fei, Li .

COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 :549-565

[4]

Bhattacharyya A., 2019, ICLR

[5]

Bojanowski P, 2014, LECT NOTES COMPUT SC, V8693, P628, DOI 10.1007/978-3-319-10602-1_41

[6]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[7]

Caesar H., P IEEE CVF C COMP VI

[8]

Cao Y, 2017, BMVC

[9]

Caputo B., 2014, Lect. Notes Comput. Sci., P192

[10] University of Michigan North Campus long-term vision and lidar dataset [J].

Carlevaris-Bianco, Nicholas ;

Ushani, Arash K. ;

Eustice, Ryan M. .

INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2016, 35 (09) :1023-1035

← 1 2 3 4 5 6 7 8 9 10 →