Multimodal feature fusion for human activity recognition using human centric temporal transformer

被引：0

作者：

Khan, Samee Ullah ^{[1
]}

Sultana, Maryam ^{[2
]}

Danish, Sufyan ^{[3
]}

Gupta, Deepak ^{[4
,5
]}

Alghamdi, Norah Saleh ^{[6
]}

Woo, Suchang ^{[7
]}

Lee, Dong-Gyu ^{[8
]}

Ahn, Sangtae ^{[1
]}

机构：

[1] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea

[2] Oxford Brookes Univ, Oxford, England

[3] Sejong Univ, Dept Comp Sci & Engn, Seoul 05006, South Korea

[4] Maharaja Agrasen Inst Technol Delhi, Rajpura, Punjab, India

[5] Chitkara Univ, Ctr Res Impact & Outcome, Rajpura 140401, Punjab, India

[6] Princess Nourah bint Abdulrahman Univ, Coll Comp & Informat Sci, Dept Comp Sci, POB 84428, Riyadh 11671, Saudi Arabia

[7] LG Elect Inc, Changwon Factory 2, Air Care Business Div, Air Solut Business,H&A Business Headquarters, Chang Won 51453, South Korea

[8] Kyungpook Natl Univ, Dept Artificial Intelligence, Daegu 41566, South Korea

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2025年 / 160卷

基金：

新加坡国家研究基金会;

关键词：

Vision transformer; Multi modality; Human activity recognition; Surveillance data; Attention module; Human centric transformer; Implemented artificial intelligence; Application of artificial intelligence; Urban safety; LSTM; INTERNET; FLOW; CNN;

D O I：

10.1016/j.engappai.2025.111844

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, human activity recognition (HAR) has focused considerable interest due to its manifold monitoring applications. Mainstream HAR approaches often face challenges with the reliability of results when relying on a single data modality, especially when integrating heterogeneous data sources. A notable limitation of the implemented artificial intelligence (AI) models is their limited capability to handle dynamic scenarios, as they lack the necessary contextual information from multiple sources, which impedes the models' adaptability and accuracy. This paper proposes a multi-modality framework for HAR that fuses human concern patterns using various spatiotemporal model flavors. In addition, to get the spatial features, a swin transformer with a dual attention concept is applied to process visual sensor data, while one-dimensional convolutional neural network leverages human skeleton information obtained from the detection model with numerous key points. Later, these multi-modality features are fused to improve the robust analysis and comprehension of activities. Next, these resulting features are passed to the human centric temporal transformer (HCTT), that has the capabilities to process multimodal sequence data for temporal learning. Moreover, the attention block of HCTT enables human-related attentive patterns followed by a dual fusion mechanism. The proposed model was evaluated on four open-access large-scale HAR datasets, where comprehensive ablation studies and comparative analyses demonstrated that our developed multimodal approach outperforms recent baseline HAR models. This underscores its potential for advancing AI applications and human activity analysis.

引用

页数：15

共 51 条

[1] Deep Learning for Heterogeneous Human Activity Recognition in Complex IoT Applications [J].

Abdel-Basset, Mohamed ;

Hawash, Hossam ;

Chang, Victor ;

Chakrabortty, Ripon K. ;

Ryan, Michael .

IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (08) :5653-5665

[2] ST-DeepHAR: Deep Learning Model for Human Activity Recognition in IoHT Applications [J].

Abdel-Basset, Mohamed ;

Hawash, Hossam ;

Chakrabortty, Ripon K. ;

Ryan, Michael ;

Elhoseny, Mohamed ;

Song, Houbing .

IEEE INTERNET OF THINGS JOURNAL, 2021, 8 (06) :4969-4979

[3] Multi-ResAtt: Multilevel Residual Network With Attention for Human Activity Recognition Using Wearable Sensors [J].

Al-qaness, Mohammed A. A. ;

Dahou, Abdelghani ;

Abd Elaziz, Mohamed ;

Helmi, A. M. .

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2023, 19 (01) :144-152

[4] Activity Recognition based on a Magnitude-Orientation Stream Network [J].

Caetano, Carlos ;

de Melo, Victor H. C. ;

dos Santos, Jefersson A. ;

Schwartz, William Robson .

2017 30TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2017, :47-54

[5]

Chen ZX, 2020, Arxiv, DOI arXiv:1811.07059

[6]

Dai W., 2024, IEEE Trans. Autom. Sci. Eng.

[7]

Dai W, 2024, Arxiv, DOI arXiv:2407.07720

[8] Deeply Supervised Skin Lesions Diagnosis With Stage and Branch Attention [J].

Dai, Wei ;

Liu, Rui ;

Wu, Tianyi ;

Wang, Min ;

Yin, Jianqin ;

Liu, Jun .

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (02) :719-729

[9] Human detection using oriented histograms of flow and appearance [J].

Dalal, Navneet ;

Triggs, Bill ;

Schmid, Cordelia .

COMPUTER VISION - ECCV 2006, PT 2, PROCEEDINGS, 2006, 3952 :428-441

[10] Human centric attention with deep multiscale feature fusion framework for activity recognition in Internet of Medical Things [J].

Hussain, Altaf ;

Khan, Samee Ullah ;

Rida, Imad ;

Khan, Noman ;

Baik, Sung Wook .

INFORMATION FUSION, 2024, 106

← 1 2 3 4 5 6 →