Masked Video and Body-Worn IMU Autoencoder for Egocentric Action Recognition

被引:0
作者
Zhang, Mingfang [1 ]
Huang, Yifei [1 ]
Liu, Ruicong [1 ]
Sato, Yoichi [1 ]
机构
[1] Univ Tokyo, Inst Ind Sci, Tokyo, Japan
来源
COMPUTER VISION-ECCV 2024, PT XVIII | 2025年 / 15076卷
关键词
Egocentric action recognition; Inertial Measurement Units; Multimodal Masked Autoencoder; VIEW;
D O I
10.1007/978-3-031-72649-1_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.
引用
收藏
页码:312 / 330
页数:19
相关论文
共 67 条
[1]   Attend and Discriminate: Beyond the State-of-the-Art for Human Activity Recognition UsingWearable Sensors [J].
Abedin, Alireza ;
Ehsanpour, Mahsa ;
Shi, Qinfeng ;
Rezatofighi, Hamid ;
Ranasinghe, Damith C. .
PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2021, 5 (01)
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]   My View is the Best View: Procedure Learning from Egocentric Videos [J].
Bansal, Siddhant ;
Arora, Chetan ;
Jawahar, C. V. .
COMPUTER VISION, ECCV 2022, PT XIII, 2022, 13673 :657-675
[4]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[5]  
Bock M, 2024, Arxiv, DOI [arXiv:2304.05088, 10.48550/ARXIV.2304.05088]
[6]   Improving Deep Learning for HAR with Shallow LSTMs [J].
Bock, Marius ;
Hoelzemann, Alexander ;
Moeller, Michael ;
Van Laerhoven, Kristof .
IWSC'21: PROCEEDINGS OF THE 2021 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2021, :7-12
[7]   Unprocessing Images for Learned Raw Denoising [J].
Brooks, Tim ;
Mildenhall, Ben ;
Xue, Tianfan ;
Chen, Jiawen ;
Sharlet, Dillon ;
Barron, Jonathan T. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :11028-11037
[8]   DarkLight Networks for Action Recognition in the Dark [J].
Chen, Rui ;
Chen, Jiajun ;
Liang, Zixi ;
Gao, Huaien ;
Lin, Shan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :846-852
[9]  
Chung JY, 2014, Arxiv, DOI arXiv:1412.3555
[10]   Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 [J].
Damen, Dima ;
Doughty, Hazel ;
Farinella, Giovanni Maria ;
Furnari, Antonino ;
Kazakos, Evangelos ;
Ma, Jian ;
Moltisanti, Davide ;
Munro, Jonathan ;
Perrett, Toby ;
Price, Will ;
Wray, Michael .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (01) :33-55