Sequential Deep Trajectory Descriptor for Action Recognition With Three-Stream CNN

被引:159
作者
Shi, Yemin [1 ]
Tian, Yonghong [1 ]
Wang, Yaowei [2 ]
Huang, Tiejun [1 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Cooperat Medianet Innovat Ctr, Natl Engn Lab Video Technol, Beijing 100871, Peoples R China
[2] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; sequential deep trajectory descriptor (sDTD); three-stream framework; long-term motion;
D O I
10.1109/TMM.2017.2666540
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning the spatial-temporal representation of motion information is crucial to human action recognition. Nevertheless, most of the existing features or descriptors cannot capture motion information effectively, especially for long-term motion. To address this problem, this paper proposes a long-term motion descriptor called sequential deep trajectory descriptor (sDTD). Specifically, we project dense trajectories into two-dimensional planes, and subsequently a CNN-RNN network is employed to learn an effective representation for long-term motion. Unlike the popular two-stream ConvNets, the sDTD stream is introduced into a three-stream framework so as to identify actions from a video sequence. Consequently, this three-stream framework can simultaneously capture static spatial features, short-term motion, and long-term motion in the video. Extensive experiments were conducted on three challenging datasets: KTH, HMDB51, and UCF101. Experimental results show that our method achieves state-of-the-art performance on the KTH and UCF101 datasets, and is comparable to the state-of-the-art methods on the HMDB51 dataset.
引用
收藏
页码:1510 / 1520
页数:11
相关论文
共 58 条
  • [41] Improving the Fisher Kernel for Large-Scale Image Classification
    Perronnin, Florent
    Sanchez, Jorge
    Mensink, Thomas
    [J]. COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 : 143 - 156
  • [42] A survey on vision-based human action recognition
    Poppe, Ronald
    [J]. IMAGE AND VISION COMPUTING, 2010, 28 (06) : 976 - 990
  • [43] Recognizing 50 human action categories of web videos
    Reddy, Kishore K.
    Shah, Mubarak
    [J]. MACHINE VISION AND APPLICATIONS, 2013, 24 (05) : 971 - 981
  • [44] Recognizing human actions:: A local SVM approach
    Schüldt, C
    Laptev, I
    Caputo, B
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, : 32 - 36
  • [45] Scovanner P., 2007, P 15 ACM INT C MULTI, P357
  • [46] Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks
    Sun, Lin
    Jia, Kui
    Yeung, Dit-Yan
    Shi, Bertram E.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4597 - 4605
  • [47] Sutskever I, 2014, ADV NEUR IN, V27
  • [48] Szegedy C, 2015, PROC CVPR IEEE, P1, DOI 10.1109/CVPR.2015.7298594
  • [49] Wang H., 2013, ICCV workshop on action recognition with a large number of classes, P1
  • [50] Action Recognition with Improved Trajectories
    Wang, Heng
    Schmid, Cordelia
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 3551 - 3558