D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

被引:29
作者
Jiang, Shengqin [1 ,2 ]
Qi, Yuankai [3 ]
Zhang, Haokui [4 ]
Bai, Zongwen [5 ,6 ]
Lu, Xiaobo [1 ,2 ]
Wang, Peng [7 ]
机构
[1] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[2] Minist Educ, Key Lab Measurement & Control Complex Syst Engn, Nanjing 210096, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Weihai 264209, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China
[5] Shaanxi Key Lab Intelligent Proc Big Energy Data, Yanan 716000, Peoples R China
[6] Yanan Univ, Sch Phys & Elect Informat, Yanan 716000, Peoples R China
[7] Univ Wollongong, Sch Comp & Informat Technol, Wollongong, NSW 2170, Australia
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Feature extraction; Convolution; Two dimensional displays; Streaming media; Kernel; Informatics; Three-dimensional convolutional neural networks (3D CNNs); action recognition; lightweight network; spatio-temporal information;
D O I
10.1109/TII.2020.3018487
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Three-dimensional convolutional neural networks (3D CNNs) have been explored to learn spatio-temporal information for video-based human action recognition. Expensive computational cost and memory demand resulted from standard 3D CNNs, however, hinder their application in practical scenarios. In this article, we address the aforementioned limitations by proposing a novel dual 3-D convolutional network (D3DNet) with two complementary lightweight branches. A coarse branch maintains large temporal receptive field by a fast temporal downsampling strategy and simulates the expensive 3-D convolutions using a combination of more efficient spatial convolutions and temporal convolutions. Meanwhile, a fine branch progressively downsamples the video in the temporal domain and adopts 3-D convolutional units with reduced channel capacities to capture multiresolution spatio-temporal information. Instead of learning these two branches independently, a shallow spatiotemporal downsampling module is shared for these two branches for efficient low-level feature learning. Besides, lateral connections are learned to effectively fuse the information from the two branches at multiple stages. The proposed network makes good balance between inference speed and action recognition performance. Based on RGB information only, it achieves competing performance on five popular video-based action recognition datasets, with inference speed of 3200 FPS on a single NVIDIA GTX 2080Ti card.
引用
收藏
页码:4584 / 4593
页数:10
相关论文
共 50 条
  • [31] 3D CONVOLUTIONAL NEURAL NETWORK WITH MULTI-MODEL FRAMEWORK FOR ACTION RECOGNITION
    Jing, Longlong
    Ye, Yuancheng
    Yang, Xiaodong
    Tian, Yingli
    [J]. 2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 1837 - 1841
  • [32] 3D-based Deep Convolutional Neural Network for action recognition with depth sequences
    Liu, Zhi
    Zhang, Chenyang
    Tian, Yingli
    [J]. IMAGE AND VISION COMPUTING, 2016, 55 : 93 - 100
  • [33] A Real-Time Sparsity-Aware 3D-CNN Processor for Mobile Hand Gesture Recognition
    Kim, Seungbin
    Jung, Jueun
    Lee, Kyuho Jason
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2024, 71 (08) : 3695 - 3707
  • [34] A Reinforcement Learning Approach for Real-Time Articulated Surgical Instrument 3-D Pose Reconstruction
    Fan, Ke
    Chen, Ziyang
    Liu, Qiaoling
    Ferrigno, Giancarlo
    De Momi, Elena
    [J]. IEEE TRANSACTIONS ON MEDICAL ROBOTICS AND BIONICS, 2024, 6 (04): : 1458 - 1467
  • [35] 3D Convolutional Neural Networks for Human Action Recognition
    Ji, Shuiwang
    Xu, Wei
    Yang, Ming
    Yu, Kai
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) : 221 - 231
  • [36] Asymmetric 3D Convolutional Neural Networks for action recognition
    Yang, Hao
    Yuan, Chunfeng
    Li, Bing
    Du, Yang
    Xing, Junliang
    Hu, Weiming
    Maybank, Stephen J.
    [J]. PATTERN RECOGNITION, 2019, 85 : 1 - 12
  • [37] Joint 3-D Human Reconstruction and Hybrid Pose Self-Supervision for Action Recognition
    Quan, Wei
    Wang, Hexin
    Li, Luwei
    Qiu, Yuxuan
    Shi, Zhiping
    Jiang, Na
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2025, 12 (07): : 8470 - 8483
  • [38] AR3D: Attention Residual 3D Network for Human Action Recognition
    Dong, Min
    Fang, Zhenglin
    Li, Yongfa
    Bi, Sheng
    Chen, Jiangcheng
    [J]. SENSORS, 2021, 21 (05) : 1 - 15
  • [39] A Deep Learning Approach for Real-Time 3D Human Action Recognition from Skeletal Data
    Huy Hieu Pham
    Salmane, Houssam
    Khoudour, Louandi
    Crouzil, Alain
    Zegers, Pablo
    Velastin, Sergio A.
    [J]. IMAGE ANALYSIS AND RECOGNITION, ICIAR 2019, PT I, 2019, 11662 : 18 - 32
  • [40] SparseVoxNet: 3-D Object Recognition With Sparsely Aggregation of 3-D Dense Blocks
    Karambakhsh, Ahmad
    Sheng, Bin
    Li, Ping
    Li, Huating
    Kim, Jinman
    Jung, Younhyun
    Chen, C. L. Philip
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (01) : 532 - 546