Self-Supervised 3-D Action Recognition by Contrasting Context-Enhanced Action Embeddings

被引：0

作者：

Ye, Kenan ^{[1
,2
]}

Zhao, Brian Nlong ^{[3
]}

Liang, Shuang ^{[1
,2
]}

Yao, Han ^{[1
,2
]}

Jia, Wenzhen ^{[1
,2
]}

机构：

[1] Tongji Univ, Sch Comp Sci & Technol, Shanghai 200092, Peoples R China

[2] Engn Res Ctr, Minist Educ, Key Software Technol Smart City Percept & Planning, Shanghai 201804, Peoples R China

[3] Univ Southern Calif, Viterbi Sch Engn, Los Angeles, CA 90089 USA

来源：

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS | 2025年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Skeleton; Joints; Three-dimensional displays; Videos; Semantics; Contrastive learning; Symbols; Kernel; Encoding; Attention mechanisms; self-supervised learning; skeleton-based action recognition;

D O I：

10.1109/TCSS.2024.3525083

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

3-D action recognition become a fast-pacing field in recent years. However, traditional approaches have limitations. They either focus on modeling overly detailed yet redundant information by reconstructing the coordinates of each body joint. Alternatively, they treat actions as a whole, overlooking the spatial and temporal variations in the semantic locality of actions. To address these limitations, we propose representing long-term actions as contexts of short-term actions organized by locality-aware graphs. In our framework, we take the inspiration that the continuity of motion and pose variations generate higher correlations. These correlations occur among spatio-temporally adjacent joints. Built upon this, we craft short-term actions as embeddings using spatio-temporal graph convolutions. This graph-based encoding not only captures richer high-level semantics but also maintains an awareness of the topology. To capture long-term action dynamics effectively, we integrate a graph convolutional gated recurrent unit (GraphGRU) for the fusion of action embeddings. Additionally, we introduce the context-aware topological attention (CTA) mechanism. Positioned between embedding encoding and context aggregation phases, CTA amplifies the features of context-relevant nodes. Lastly, we create self-supervision by contrasting predicted embeddings with actual encoded embeddings. This approach explicitly learns changes in dynamics to obtain distinct embeddings. Empirical evaluations demonstrate that our approach outperforms mainstream unsupervised 3-D action recognition methods.

引用

页数：15

共 50 条

[31] Joint 3-D Human Reconstruction and Hybrid Pose Self-Supervision for Action Recognition
Quan, Wei
Wang, Hexin
Li, Luwei
Qiu, Yuxuan
Shi, Zhiping
Jiang, Na
IEEE INTERNET OF THINGS JOURNAL, 2025, 12 (07): : 8470 - 8483
[32] DMMG: Dual Min-Max Games for Self-Supervised Skeleton-Based Action Recognition
Guan, Shannan
Yu, Xin
Huang, Wei
Fang, Gengfa
Lu, Haiyan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 395 - 407
[33] Efficient Spatio-Temporal Contrastive Learning for Skeleton-Based 3-D Action Recognition
Gao, Xuehao
Yang, Yang
Zhang, Yimeng
Li, Maosen
Yu, Jin-Gang
Du, Shaoyi
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 405 - 417
[34] CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation
Mao, Yunyao
Zhou, Wengang
Lu, Zhenbo
Deng, Jiajun
Li, Houqiang
COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 734 - 752
[35] A Self-Supervised Pretraining Framework for Context-Aware Building Edge Extraction From 3-D Point Clouds
Yang, Hongxin
Xu, Shanshan
Xu, Sheng
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2025, 22
[36] D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition
Jiang, Shengqin
Qi, Yuankai
Zhang, Haokui
Bai, Zongwen
Lu, Xiaobo
Wang, Peng
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2021, 17 (07) : 4584 - 4593
[37] DIDA: Dynamic Individual-to-integrateD Augmentation for Self-supervised Skeleton-Based Action Recognition
Hu, Haobo
Li, Jianan
Fan, Hongbin
Zhao, Zhifu
Zhou, Yangtao
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 496 - 510
[38] Self-supervised Representation Learning for Fine Grained Human Hand Action Recognition in Industrial Assembly Lines
Sturm, Fabian
Sathiyababu, Rahul
Allipilli, Harshitha
Hergenroether, Elke
Siegel, Melanie
ADVANCES IN VISUAL COMPUTING, ISVC 2023, PT I, 2023, 14361 : 172 - 184
[39] Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition
Bi, Shuai
Hu, Zhengping
Zhao, Mengyao
Zhang, Hehao
Di, Jirui
Sun, Zhe
SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (07) : 3775 - 3782
[40] Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition
Wu, Yushan
Xu, Zengmin
Yuan, Mengwei
Tang, Tianchi
Meng, Ruxing
Wang, Zhongyuan
MULTIMEDIA SYSTEMS, 2024, 30 (05)

← 1 2 3 4 5 →