Delving Deep Into One-Shot Skeleton-Based Action Recognition With Diverse Occlusions

被引:23
作者
Peng, Kunyu [1 ]
Roitberg, Alina [1 ]
Yang, Kailun [1 ]
Zhang, Jiaming [1 ]
Stiefelhagen, Rainer [1 ]
机构
[1] Karlsruhe Inst Technol, Inst Anthropomat & Robot, D-76131 Karlsruhe, Germany
关键词
Transformers; Three-dimensional displays; Task analysis; Benchmark testing; Joints; Prototypes; Image recognition; Computer vision; human activity recognition; representation learning;
D O I
10.1109/TMM.2023.3235300
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Occlusions areuniversal disruptions constantly present in the real world. Especially for sparse representations, such as human skeletons, a few occluded points might destroy the geometrical and temporal continuity critically affecting the results. Yet, the research of data-scarce recognition from skeleton sequences, such as one-shot action recognition, does not explicitly consider occlusions despite their everyday pervasiveness. In this work, we explicitly tackle body occlusions for Skeleton-based One-shot Action Recognition (SOAR). We mainly consider two occlusion variants: 1) random occlusions and 2) more realistic occlusions caused by diverse everyday objects, which we generate by projecting the existing IKEA 3D furniture models into the camera coordinate system of the 3D skeletons with different geometric parameters, (e.g., rotation and displacement). We leverage the proposed pipeline to blend out portions of skeleton sequences of the three popular action recognition datasets (NTU-120, NTU-60 and Toyota Smart Home) and formalize the first benchmark for SOAR from partially occluded body poses. This is the first benchmark which considers occlusions for data-scarce action recognition. Another key property of our benchmark are the more realistic occlusions generated by everyday objects, as even in standard recognition from 3D skeletons, only randomly missing joints were considered. We re-evaluate existing state-of-the-art frameworks for SOAR in the light of this new task and further introduce Trans4SOAR - a new transformer-based model which leverages three data streams and mixed attention fusion mechanism to alleviate the adverse effects caused by occlusions. While our experiments demonstrate a clear decline in accuracy with missing skeleton portions, this effect is smaller with Trans4SOAR, which outperforms other architectures on all datasets. Although we specifically focus on occlusions, Trans4SOAR additionally yields state-of-the-art in the standard SOAR without occlusion, surpassing the best published approach by 2.85% on NTU-120.
引用
收藏
页码:1489 / 1504
页数:16
相关论文
共 64 条
[1]   2D Pose-Based Real-Time Human Action Recognition With Occlusion-Handling [J].
Angelini, Federico ;
Fu, Zeyu ;
Long, Yang ;
Shao, Ling ;
Naqvi, Syed Mohsen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (06) :1433-1446
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]  
Bai Ruwen., 2022, P IEEE INT C MULT EX, P01, DOI [10.1109/ICME52920.2022.9859781, DOI 10.1109/ICME52920.2022.9859781]
[4]   Easy-Ensemble Augmented-Shot-Y-Shaped Learning: State-of-the-Art Few-Shot Classification with Simple Components [J].
Bendou, Yassir ;
Hu, Yuqing ;
Lafargue, Raphael ;
Lioi, Giulia ;
Pasdeloup, Bastien ;
Pateux, Stephane ;
Gripon, Vincent .
JOURNAL OF IMAGING, 2022, 8 (07)
[5]   Few-shot action recognition with implicit temporal alignment and pair similarity optimization [J].
Cao, Congqi ;
Li, Yajuan ;
Lv, Qinyi ;
Wang, Peng ;
Zhang, Yanning .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 210
[6]   Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition [J].
Chen, Yuxin ;
Zhang, Ziqi ;
Yuan, Chunfeng ;
Li, Bing ;
Deng, Ying ;
Hu, Weiming .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :13339-13348
[7]  
Cheng Y.-B., 2021, P 2 ACM INT C MULT A, P1
[8]  
Chu XX, 2021, ADV NEUR IN
[9]   MixFormer: End-to-End Tracking with Iterative Mixed Attention [J].
Cui, Yutao ;
Jiang, Cheng ;
Wang, Limin ;
Wu, Gangshan .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :13598-13608
[10]   Toyota Smarthome: Real-World Activities of Daily Living [J].
Das, Srijan ;
Dai, Rui ;
Koperski, Michal ;
Minciullo, Luca ;
Garattoni, Lorenzo ;
Bremond, Francois ;
Francesca, Gianpiero .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :833-842