B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition

被引：13

作者：

Guo, Fangtai ^{[1
]}

Jin, Tianlei ^{[1
]}

Zhu, Shiqiang ^{[1
]}

Xi, Xiangming ^{[1
]}

Wang, Wen ^{[1
]}

Meng, Qiwei ^{[1
]}

Song, Wei ^{[1
]}

Zhu, Jiakai ^{[1
]}

机构：

[1] Zhejiang Lab, Res Ctr Intelligent Robot, Hangzhou 311121, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2023年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Human action recognition; homogeneous modalities; fusion model; limb flow fields; B2C-AFM;

D O I：

10.1109/TIP.2023.3308750

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interactions and out-of-context sign languages have been validated to depend on scene category and human per se. Those attempts to integrate appearance features and human poses have shown positive results. However, with human poses' spatial errors and temporal ambiguities, existing methods are subject to poor scalability, limited robustness, and sub-optimal models. In this paper, inspired by the assumption that different modalities may maintain temporal consistency and spatial complementarity, we present a novel Bi-directional Co-temporal and Cross-spatial Attention Fusion Model (B2C-AFM). Our model is characterized by the asynchronous fusion strategy of multi-modal features along temporal and spatial dimensions. Besides, the novel explicit motion-oriented pose representations called Limb Flow Fields (Lff) are explored to alleviate the temporal ambiguity regarding human poses. Experiments on publicly available datasets validate our contributions. Abundant ablation studies experimentally show that B2C-AFM achieves robust performance across seen and unseen human actions. The codes are available at https://github.com/gftww/B2C.git.

引用

页码：4989 / 5003

页数：15

共 60 条

[1] STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition [J].

Ahn, Dasom ;

Kim, Sangwon ;

Hong, Hyunsu ;

Ko, Byoung Chul .

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :3319-3328

[2]

Azagra P., 2016, P NIPS WORKSH FUT IN, P1

[3] Structural Knowledge Distillation for Efficient Skeleton-Based Action Recognition [J].

Bian, Cunling ;

Feng, Wei ;

Wan, Liang ;

Wang, Song .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :2963-2976

[4] Multimodal Attentive Fusion Network for audio-visual event recognition [J].

Brousmiche, Mathilde ;

Rouat, Jean ;

Dupont, Stephane .

INFORMATION FUSION, 2022, 85 :52-59

[5]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[6] Temporal Hockey Action Recognition via Pose and Optical Flows [J].

Cai, Zixi ;

Neher, Helmut ;

Vats, Kanav ;

Clausi, David A. ;

Zelek, John .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, :2543-2552

[7] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].

Cao, Zhe ;

Simon, Tomas ;

Wei, Shih-En ;

Sheikh, Yaser .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310

[8] Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [J].

Chen, Xiaokang ;

Lin, Kwan-Yee ;

Wang, Jingbo ;

Wu, Wayne ;

Qian, Chen ;

Li, Hongsheng ;

Zeng, Gang .

COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :561-577

[9] PoTion: Pose MoTion Representation for Action Recognition [J].

Choutas, Vasileios ;

Weinzaepfel, Philippe ;

Revaud, Jerome ;

Schmid, Cordelia .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7024-7033

[10]

Doering A., 2018, BRIT MACH VIS C

← 1 2 3 4 5 6 →