Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

被引:112
作者
Boulahia, Said Yacine [1 ]
Amamra, Abdenour [1 ]
Madi, Mohamed Ridha [1 ]
Daikh, Said [1 ]
机构
[1] Ecole Mil Polytech, BP 17, Algiers 16111, Algeria
关键词
Action recognition; Early fusion; Intermediate fusion; Late fusion; Deep learning; RGB-D; SKELETON;
D O I
10.1007/s00138-021-01249-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal action recognition techniques combine several image modalities (RGB, Depth, Skeleton, and InfraRed) for a more robust recognition. According to the fusion level in the action recognition pipeline, we can distinguish three families of approaches: early fusion, where the raw modalities are combined ahead of feature extraction; intermediate fusion, the features, respective to each modality, are concatenated before classification; and late fusion, where the modality-wise classification results are combined. After reviewing the literature, we identified the principal defects of each category, which we try to address by first investigating more deeply the early-stage fusion that has been poorly explored in the literature. Second, intermediate fusion protocols operate on the feature map, irrespective of the particularity of human action, we propose a new scheme where we optimally combine modality-wise features. Third, as most of the late fusion solutions use handcrafted rules, prone to human bias, and far from real-world peculiarities, we adopt a neural learning strategy to extract significant features from data rather than assuming that artificial rules are correct. We validated our findings on two challenging datasets. Our obtained results were as good or better than their literature counterparts.
引用
收藏
页数:18
相关论文
共 46 条
[31]   MFAS: Multimodal Fusion Architecture Search [J].
Perez-Rua, Juan-Manuel ;
Vielzeuf, Valentin ;
Pateux, Stephane ;
Baccouche, Moez ;
Jurie, Frederic .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6959-6968
[32]   Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks [J].
Pham, Cuong ;
Nguyen, Linh ;
Nguyen, Anh ;
Nguyen, Ngon ;
Nguyen, Van-Toi .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (19) :28919-28940
[33]   Shedding Light on People Action Recognition in Social Robotics by Means of Common Spatial Patterns [J].
Rodriguez-Moreno, Itsaso ;
Maria Martinez-Otzeta, Jose ;
Goienetxea, Izaro ;
Rodriguez-Rodriguez, Igor ;
Sierra, Basilio .
SENSORS, 2020, 20 (08)
[34]   Deep Multimodal Feature Analysis for Action Recognition in RGB plus D Videos [J].
Shahroudy, Amir ;
Ng, Tian-Tsong ;
Gong, Yihong ;
Wang, Gang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) :1045-1058
[35]   NTU RGB plus D: A Large Scale Dataset for 3D Human Activity Analysis [J].
Shahroudy, Amir ;
Liu, Jun ;
Ng, Tian-Tsong ;
Wang, Gang .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1010-1019
[36]  
Shahroudy A, 2014, 2014 6TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING (ISCCSP), P73, DOI 10.1109/ISCCSP.2014.6877819
[37]  
Shotton J, 2011, PROC CVPR IEEE, P1297, DOI 10.1109/CVPR.2011.5995316
[38]  
Simonyan K, 2015, Arxiv, DOI [arXiv:1409.1556, 10.48550/arXiv.1409.1556, DOI 10.48550/ARXIV.1409.1556]
[39]  
Su L., 2020, ARXIV201207175
[40]  
Tan MX, 2019, PR MACH LEARN RES, V97