SlowFast Multimodality Compensation Fusion Swin Transformer Networks for RGB-D Action Recognition

被引：4

作者：

Xiao, Xiongjiang ^{[1
]}

Ren, Ziliang ^{[1
]}

Li, Huan ^{[1
]}

Wei, Wenhong ^{[1
]}

Yang, Zhiyong ^{[2
]}

Yang, Huaide ^{[3
]}

机构：

[1] Dongguan Univ Technol, Sch Comp Sci & Technol, Dongguan 523820, Peoples R China

[2] Yantai Inst Technol, Sch Artificial Intelligence, Yantai 264003, Peoples R China

[3] Dongguan Polytech, Sch Elect Informat, Dongguan 523109, Peoples R China

来源：

MATHEMATICS | 2023年 / 11卷 / 09期

基金：

中国国家自然科学基金;

关键词：

action recognition; multimodality compensation; SlowFast pathways; swin transformer; dual-stream; NEURAL-NETWORKS; REPRESENTATION;

D O I：

10.3390/math11092115

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

RGB-D-based technology combines the advantages of RGB and depth sequences which can effectively recognize human actions in different environments. However, the spatio-temporal information between different modalities is difficult to effectively learn from each other. To enhance the information exchange between different modalities, we introduce a SlowFast multimodality compensation block (SFMCB) which is designed to extract compensation features. Concretely, the SFMCB fuses features from two independent pathways with different frame rates into a single convolutional neural network to achieve performance gains for the model. Furthermore, we explore two fusion schemes to combine the feature from two independent pathways with different frame rates. To facilitate the learning of features from independent multiple pathways, multiple loss functions are utilized for joint optimization. To evaluate the effectiveness of our proposed architecture, we conducted experiments on four challenging datasets: NTU RGB+D 60, NTU RGB+D 120, THU-READ, and PKU-MMD. Experimental results demonstrate the effectiveness of our proposed model, which utilizes the SFMCB mechanism to capture complementary features of multimodal inputs.

引用

页数：19

共 53 条

[1] ViViT: A Video Vision Transformer
Arnab, Anurag
Dehghani, Mostafa
Heigold, Georg
Sun, Chen
Lucic, Mario
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
[2] Action Recognition with Dynamic Image Networks
Bilen, Hakan
Fernando, Basura
Gavves, Efstratios
Vedaldi, Andrea
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (12) : 2799 - 2813
[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[4] Cross-Modality Compensation Convolutional Neural Networks for RGB-D Action Recognition
Cheng, Jun
Ren, Ziliang
Zhang, Qieshi
Gao, Xiangyang
Hao, Fusheng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1498 - 1509
[5] A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector
Das Dawn, Debapratim
Shaikh, Soharab Hossain
[J]. VISUAL COMPUTER, 2016, 32 (03) : 289 - 306
[6] Das Srijan, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12354), P72, DOI 10.1007/978-3-030-58545-7_5
[7] Long-Term Recurrent Convolutional Networks for Visual Recognition and Description
Donahue, Jeff
Hendricks, Lisa Anne
Rohrbach, Marcus
Venugopalan, Subhashini
Guadarrama, Sergio
Saenko, Kate
Darrell, Trevor
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) : 677 - 691
[8] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[9] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[10] Understanding the Gap between 2D and 3D Skeleton-Based Action Recognition
Elias, Petr
Sedmidubsky, Jan
Zezula, Pavel
[J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2019), 2019, : 192 - 195

← 1 2 3 4 5 6 →