Modality Distillation with Multiple Stream Networks for Action Recognition

被引:114
作者
Garcia, Nuno C. [1 ,2 ]
Morerio, Pietro [1 ]
Murino, Vittorio [1 ,3 ]
机构
[1] Ist Italiano Tecnol, Genoa, Italy
[2] Univ Genoa, Genoa, Italy
[3] Univ Verona, Verona, Italy
来源
COMPUTER VISION - ECCV 2018, PT VIII | 2018年 / 11212卷
关键词
Action recognition; Deep multimodal learning; Distillation; Privileged information;
D O I
10.1007/978-3-030-01237-3_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities are available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This paper presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. Particularly, we consider the case of learning representations from depth and RGB videos, while relying on RGB data only at test time. We propose a new approach to train an hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, as well as distance between feature maps. We report state-of-the-art results on video action classification on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the UWA3DII and Northwestern-UCLA.
引用
收藏
页码:106 / 121
页数:16
相关论文
共 34 条
  • [11] Gkioxari G, 2015, PROC CVPR IEEE, P759, DOI 10.1109/CVPR.2015.7298676
  • [12] Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels
    Han, Bo
    Yao, Quanming
    Yu, Xingrui
    Niu, Gang
    Xu, Miao
    Hu, Weihua
    Tsang, Ivor W.
    Sugiyama, Masashi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [13] Heng Wang, 2011, 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), P3169, DOI 10.1109/CVPR.2011.5995407
  • [14] Learning with Side Information through Modality Hallucination
    Hoffman, Judy
    Gupta, Saurabh
    Darrell, Trevor
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 826 - 834
  • [15] Ioffe Sergey, 2015, PROC CVPR IEEE, P448, DOI DOI 10.1109/CVPR.2016.90
  • [16] Large-scale Video Classification with Convolutional Neural Networks
    Karpathy, Andrej
    Toderici, George
    Shetty, Sanketh
    Leung, Thomas
    Sukthankar, Rahul
    Fei-Fei, Li
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 1725 - 1732
  • [17] Learning realistic human actions from movies
    Laptev, Ivan
    Marszalek, Marcin
    Schmid, Cordelia
    Rozenfeld, Benjamin
    [J]. 2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, : 3222 - +
  • [18] Liu J., 2017, ARXIV170905087
  • [19] Lopez-Paz D., 2016, P INT C LEARN REPR I
  • [20] Luo Z., 2017, ARXIV171200108