D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

被引:29
作者
Jiang, Shengqin [1 ,2 ]
Qi, Yuankai [3 ]
Zhang, Haokui [4 ]
Bai, Zongwen [5 ,6 ]
Lu, Xiaobo [1 ,2 ]
Wang, Peng [7 ]
机构
[1] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[2] Minist Educ, Key Lab Measurement & Control Complex Syst Engn, Nanjing 210096, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Weihai 264209, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China
[5] Shaanxi Key Lab Intelligent Proc Big Energy Data, Yanan 716000, Peoples R China
[6] Yanan Univ, Sch Phys & Elect Informat, Yanan 716000, Peoples R China
[7] Univ Wollongong, Sch Comp & Informat Technol, Wollongong, NSW 2170, Australia
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Feature extraction; Convolution; Two dimensional displays; Streaming media; Kernel; Informatics; Three-dimensional convolutional neural networks (3D CNNs); action recognition; lightweight network; spatio-temporal information;
D O I
10.1109/TII.2020.3018487
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Three-dimensional convolutional neural networks (3D CNNs) have been explored to learn spatio-temporal information for video-based human action recognition. Expensive computational cost and memory demand resulted from standard 3D CNNs, however, hinder their application in practical scenarios. In this article, we address the aforementioned limitations by proposing a novel dual 3-D convolutional network (D3DNet) with two complementary lightweight branches. A coarse branch maintains large temporal receptive field by a fast temporal downsampling strategy and simulates the expensive 3-D convolutions using a combination of more efficient spatial convolutions and temporal convolutions. Meanwhile, a fine branch progressively downsamples the video in the temporal domain and adopts 3-D convolutional units with reduced channel capacities to capture multiresolution spatio-temporal information. Instead of learning these two branches independently, a shallow spatiotemporal downsampling module is shared for these two branches for efficient low-level feature learning. Besides, lateral connections are learned to effectively fuse the information from the two branches at multiple stages. The proposed network makes good balance between inference speed and action recognition performance. Based on RGB information only, it achieves competing performance on five popular video-based action recognition datasets, with inference speed of 3200 FPS on a single NVIDIA GTX 2080Ti card.
引用
收藏
页码:4584 / 4593
页数:10
相关论文
共 50 条
  • [21] 3-D HANet: A Flexible 3-D Heatmap Auxiliary Network for Object Detection
    Xia, Qiming
    Chen, Yidong
    Cai, Guorong
    Chen, Guikun
    Xie, Daoshun
    Su, Jinhe
    Wang, Zongyue
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [22] Complex-Valued 3-D Convolutional Neural Network for PolSAR Image Classification
    Tan, Xiaofeng
    Li, Ming
    Zhang, Peng
    Wu, Yan
    Song, Wanying
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2020, 17 (06) : 1022 - 1026
  • [23] Video-Based Air Quality Measurement With Dual-Channel 3-D Convolutional Network
    Wang, Zhenyu
    Yue, Shaolong
    Song, Chunfeng
    IEEE INTERNET OF THINGS JOURNAL, 2021, 8 (18): : 14372 - 14384
  • [24] TIME-ASYMMETRIC 3D CONVOLUTIONAL NEURAL NETWORKS FOR ACTION RECOGNITION
    Wu, Chengjie
    Han, Jiayue
    Li, Xiaoqiang
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 21 - 25
  • [25] Multi-Task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition
    Luvizon, Diogo C.
    Picard, David
    Tabia, Hedi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (08) : 2752 - 2764
  • [26] Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition
    Xu, Shihao
    Rao, Haocong
    Peng, Hong
    Jiang, Xin
    Guo, Yi
    Hu, Xiping
    Hu, Bin
    IEEE INTERNET OF THINGS JOURNAL, 2021, 8 (21) : 15990 - 16001
  • [27] Real-time 3-D Object Recognition Using Scale Invariant Feature Transform and Stereo Vision
    Hsu, Gee-Sern
    Lin, Chyi-Yeu
    Wu, Jia-Shan
    PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON AUTONOMOUS ROBOTS AND AGENTS, 2009, : 59 - 64
  • [28] Human Action Recognition from RGB-D Frames Based on Real-Time 3D Optical Flow Estimation
    Ballin, Gioia
    Munaro, Matteo
    Menegatti, Emanuele
    BIOLOGICALLY INSPIRED COGNITIVE ARCHITECTURES 2012, 2013, 196 : 65 - 74
  • [29] APFN: Adaptive Perspective-Based Fusion Network for 3-D Place Recognition
    Zhu, Jianxiang
    Yang, Keni
    Zhang, Yangchun
    Peng, Yan
    Peng, Yaxin
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73
  • [30] Learning Spatio-Temporal Representations With a Dual-Stream 3-D Residual Network for Nondriving Activity Recognition
    Yang, Lichao
    Shan, Xiaocai
    Lv, Chen
    Brighton, James
    Zhao, Yifan
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2022, 69 (07) : 7405 - 7414