Modality Distillation with Multiple Stream Networks for Action Recognition

被引：114

作者：

Garcia, Nuno C. ^{[1
,2
]}

Morerio, Pietro ^{[1
]}

Murino, Vittorio ^{[1
,3
]}

机构：

[1] Ist Italiano Tecnol, Genoa, Italy

[2] Univ Genoa, Genoa, Italy

[3] Univ Verona, Verona, Italy

来源：

COMPUTER VISION - ECCV 2018, PT VIII | 2018年 / 11212卷

关键词：

Action recognition; Deep multimodal learning; Distillation; Privileged information;

D O I：

10.1007/978-3-030-01237-3_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities are available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This paper presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. Particularly, we consider the case of learning representations from depth and RGB videos, while relying on RGB data only at test time. We propose a new approach to train an hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, as well as distance between feature maps. We report state-of-the-art results on video action classification on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the UWA3DII and Northwestern-UCLA.

引用

页码：106 / 121

页数：16

共 34 条

[1] [Anonymous], 2017, P IEEE C COMP VIS PA
[2] [Anonymous], 2017, ARXIV171107971
[3] [Anonymous], 2014, DEEP LEARN REPR LEAR
[4] Ba LJ, 2014, ADV NEUR IN, V27
[5] Histograms of oriented gradients for human detection
Dalal, N
Triggs, B
[J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 886 - 893
[6] Deng YX, 2017, IEEE DEVICE RES CONF
[7] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[8] Eitel A, 2015, IEEE INT C INT ROBOT, P681, DOI 10.1109/IROS.2015.7353446
[9] Convolutional Two-Stream Network Fusion for Video Action Recognition
Feichtenhofer, Christoph
Pinz, Axel
Zisserman, Andrew
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1933 - 1941
[10] Feichtenhofer Christoph, 2017, P IEEE C COMP VIS PA, P4768

← 1 2 3 4 →