Temporal cues enhanced multimodal learning for action recognition in RGB-D videos

被引:5
作者
Liu, Dan [1 ,3 ,4 ]
Meng, Fanrong [1 ]
Xia, Qing [1 ]
Ma, Zhiyuan [1 ]
Mi, Jinpeng [1 ,4 ]
Gan, Yan [2 ]
Ye, Mao [3 ]
Zhang, Jianwei [4 ]
机构
[1] Univ Shanghai Sci & Technol, Inst Machine Intelligence IMI, Shanghai 200093, Peoples R China
[2] Chongqing Univ, Coll Comp Sci, Chongqing 400044, Peoples R China
[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[4] Univ Hamburg, Tech Aspects Multimodal Syst TAMS Grp, D-22527 Hamburg, Germany
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Human action recognition; Multimodal learning; Co-learning; Temporal modeling; NETWORK;
D O I
10.1016/j.neucom.2024.127882
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action recognition is an important and active research direction in computer vision, where temporal modeling is critical for action representation. Generally, unimodal methods that use only RGB or skeleton modality for human action recognition have their limitations, e.g., information redundancy/environment noise of RGB video modality, and spatial interaction deficiency of skeleton modality. In this paper, we present a novel multimodal learning approach based on RGB and skeleton modalities for action recognition in RGB-D videos. Specifically, we (1) transfer skeleton knowledge to RGB video for effective video compression, which produces the informative action image from raw RGB video, (2) introduce the temporal cues enhancement module to adequately learn the spatiotemporal representation for action classification, and (3) propose a multi-level multimodal co-learning framework for human action recognition in RGB-D videos. Experimental results on NTU RGB+D, PKU-MMD, and N-UCLA datasets demonstrate the effectiveness of the proposed multimodal learning method.
引用
收藏
页数:10
相关论文
共 53 条
[1]   STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition [J].
Ahn, Dasom ;
Kim, Sangwon ;
Hong, Hyunsu ;
Ko, Byoung Chul .
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :3319-3328
[2]  
An J., 2020, Journal of Physics: Conference Series, V1693
[3]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[4]   Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points [J].
Baradel, Fabien ;
Wolf, Christian ;
Mille, Julien ;
Taylor, Graham W. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :469-478
[5]   Human Action Recognition: Pose-based Attention draws focus to Hands [J].
Baradel, Fabien ;
Wolf, Christian ;
Mille, Julien .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, :604-613
[6]  
C.V.N. Index, 2019, Forecast and trends, 2017-2022 white paper
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]   Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition [J].
Chen, Yuxin ;
Zhang, Ziqi ;
Yuan, Chunfeng ;
Li, Bing ;
Deng, Ying ;
Hu, Weiming .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :13339-13348
[9]   InfoGCN: Representation Learning for Human Skeleton-based Action Recognition [J].
Chi, Hyung-gun ;
Ha, Myoung Hoon ;
Chi, Seunggeun ;
Lee, Sang Wan ;
Huang, Qixing ;
Ramani, Karthik .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :20154-20164
[10]  
Das Srijan, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12354), P72, DOI 10.1007/978-3-030-58545-7_5