Sequential Deep Trajectory Descriptor for Action Recognition With Three-Stream CNN

被引:159
作者
Shi, Yemin [1 ]
Tian, Yonghong [1 ]
Wang, Yaowei [2 ]
Huang, Tiejun [1 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Cooperat Medianet Innovat Ctr, Natl Engn Lab Video Technol, Beijing 100871, Peoples R China
[2] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; sequential deep trajectory descriptor (sDTD); three-stream framework; long-term motion;
D O I
10.1109/TMM.2017.2666540
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning the spatial-temporal representation of motion information is crucial to human action recognition. Nevertheless, most of the existing features or descriptors cannot capture motion information effectively, especially for long-term motion. To address this problem, this paper proposes a long-term motion descriptor called sequential deep trajectory descriptor (sDTD). Specifically, we project dense trajectories into two-dimensional planes, and subsequently a CNN-RNN network is employed to learn an effective representation for long-term motion. Unlike the popular two-stream ConvNets, the sDTD stream is introduced into a three-stream framework so as to identify actions from a video sequence. Consequently, this three-stream framework can simultaneously capture static spatial features, short-term motion, and long-term motion in the video. Extensive experiments were conducted on three challenging datasets: KTH, HMDB51, and UCF101. Experimental results show that our method achieves state-of-the-art performance on the KTH and UCF101 datasets, and is comparable to the state-of-the-art methods on the HMDB51 dataset.
引用
收藏
页码:1510 / 1520
页数:11
相关论文
共 58 条
  • [11] [Anonymous], 2014, CORR
  • [12] [Anonymous], 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME)
  • [13] [Anonymous], 2015, CORR
  • [14] SURF: Speeded up robust features
    Bay, Herbert
    Tuytelaars, Tinne
    Van Gool, Luc
    [J]. COMPUTER VISION - ECCV 2006 , PT 1, PROCEEDINGS, 2006, 3951 : 404 - 417
  • [15] Bilinski P., 2013, P 2013 10 IEEE INT C, P1
  • [16] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [17] Multi-View Super Vector for Action Recognition
    Cai, Zhuowei
    Wang, Limin
    Peng, Xiaojiang
    Qiao, Yu
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 596 - 603
  • [18] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [19] Histograms of oriented gradients for human detection
    Dalal, N
    Triggs, B
    [J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 886 - 893
  • [20] Human detection using oriented histograms of flow and appearance
    Dalal, Navneet
    Triggs, Bill
    Schmid, Cordelia
    [J]. COMPUTER VISION - ECCV 2006, PT 2, PROCEEDINGS, 2006, 3952 : 428 - 441