SMTDKD: A Semantic-Aware Multimodal Transformer Fusion Decoupled Knowledge Distillation Method for Action Recognition

被引：3

作者：

Quan, Zhenzhen ^{[1
]}

Chen, Qingshan ^{[1
]}

Wang, Wei ^{[1
]}

Zhang, Moyan ^{[1
]}

Li, Xiang ^{[1
]}

Li, Yujun ^{[1
]}

Liu, Zhi ^{[1
]}

机构：

[1] Shandong Univ, Sch Informat Sci & Engn, Qingdao 266237, Shandong, Peoples R China

来源：

IEEE SENSORS JOURNAL | 2024年 / 24卷 / 02期

关键词：

Transformers; Sensors; Feature extraction; Wearable sensors; Visualization; Semantics; Knowledge engineering; Keywords Decoupled knowledge distillation; human action recognition (HAR); multimodal; transformer; wearable sensor; VISION;

D O I：

10.1109/JSEN.2023.3337367

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Multimodal sensors, including vision sensors and wearable sensors, offer valuable complementary information for accurate recognition tasks. Nonetheless, the heterogeneity among sensor data from different modalities presents a formidable challenge in extracting robust multimodal information amidst noise. In this article, we propose an innovative approach, named semantic-aware multimodal transformer fusion decoupled knowledge distillation (SMTDKD) method, which guides video data recognition not only through the information interaction between different wearable-sensor data, but also through the information interaction between visual sensor data and wearable-sensor data, improving the robustness of the model. To preserve the temporal relationship within wearable-sensor data, the SMTDKD method converts them into 2-D image data. Furthermore, a transformer-based multimodal fusion module is designed to capture diverse feature information from distinct wearable-sensor modalities. To mitigate modality discrepancies and encourage similar semantic features, graph cross-view attention maps are constructed across various convolutional layers to facilitate feature alignment. Additionally, semantic information is exchanged among the teacher-student network, the student network, and bidirectional encoder representations from transformer (BERT)-encoded labels. To obtain more comprehensive knowledge transfer, the decoupled knowledge distillation loss is utilized, thus enhancing the generalization of the network. Experimental evaluations conducted on three multimodal datasets, namely, UTD-MHAD, Berkeley-MHAD, and MMAct, demonstrate the superior performance of the proposed SMTDKD method over the state-of-the-art action human recognition methods.

引用

页码：2289 / 2304

页数：16

共 81 条

[11] Robust Human Activity Recognition Using Multimodel Feature-Level Fusion [J].

Ehatisham-Ul-Haq, Muhammad ;

Javed, Ali ;

Azam, Muhammad Awais ;

Malik, Hafiz M. A. ;

Irtaza, Aun ;

Lee, Ik Hyun ;

Mahmood, Muhammad Tariq .

IEEE ACCESS, 2019, 7 :60736-60751

[12] Deep Multimodal Representation Learning: A Survey [J].

Guo, Wenzhong ;

Wang, Jianwen ;

Wang, Shiping .

IEEE ACCESS, 2019, 7 :63373-63394

[13]

Li LH, 2019, Arxiv, DOI [arXiv:1908.03557, DOI 10.48550/ARXIV.1908.03557]

[14] Human Activity Recognition for Elderly People Using Machine and Deep Learning Approaches [J].

Hayat, Ahatsham ;

Morgado-Dias, Fernando ;

Bhuyan, Bikram Pratim ;

Tomar, Ravi .

INFORMATION, 2022, 13 (06)

[15]

Hegde N., 2019, IEEE Internet Things J, V6, P1427

[16]

Hinton G, 2015, Arxiv, DOI [arXiv:1503.02531, DOI 10.48550/ARXIV.1503.02531]

[17] Multimodal Transformer for Nursing Activity Recognition [J].

Ijaz, Momal ;

Diaz, Renato ;

Chen, Chen .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, :2064-2073

[18] Human action recognition based on MOCAP information using convolution neural networks [J].

Ijjina, Earnest Paul ;

Mohan, C. Krishna .

2014 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2014, :159-164

[19] TC-Net: A Modest & Lightweight Emotion Recognition System Using Temporal Convolution Network [J].

Ishaq M. ;

Khan M. ;

Kwon S. .

Computer Systems Science and Engineering, 2023, 46 (03) :3355-3369

[20] MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion [J].

Islam, Md Mofijul ;

Yasar, Mohammad Samin ;

Iqbal, Tariq .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :3694-3708

← 1 2 3 4 5 6 7 8 9 →