D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

被引：29

作者：

Jiang, Shengqin ^{[1
,2
]}

Qi, Yuankai ^{[3
]}

Zhang, Haokui ^{[4
]}

Bai, Zongwen ^{[5
,6
]}

Lu, Xiaobo ^{[1
,2
]}

Wang, Peng ^{[7
]}

机构：

[1] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China

[2] Minist Educ, Key Lab Measurement & Control Complex Syst Engn, Nanjing 210096, Peoples R China

[3] Harbin Inst Technol, Sch Comp Sci & Technol, Weihai 264209, Peoples R China

[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China

[5] Shaanxi Key Lab Intelligent Proc Big Energy Data, Yanan 716000, Peoples R China

[6] Yanan Univ, Sch Phys & Elect Informat, Yanan 716000, Peoples R China

[7] Univ Wollongong, Sch Comp & Informat Technol, Wollongong, NSW 2170, Australia

来源：

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS | 2021年 / 17卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Three-dimensional displays; Feature extraction; Convolution; Two dimensional displays; Streaming media; Kernel; Informatics; Three-dimensional convolutional neural networks (3D CNNs); action recognition; lightweight network; spatio-temporal information;

D O I：

10.1109/TII.2020.3018487

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Three-dimensional convolutional neural networks (3D CNNs) have been explored to learn spatio-temporal information for video-based human action recognition. Expensive computational cost and memory demand resulted from standard 3D CNNs, however, hinder their application in practical scenarios. In this article, we address the aforementioned limitations by proposing a novel dual 3-D convolutional network (D3DNet) with two complementary lightweight branches. A coarse branch maintains large temporal receptive field by a fast temporal downsampling strategy and simulates the expensive 3-D convolutions using a combination of more efficient spatial convolutions and temporal convolutions. Meanwhile, a fine branch progressively downsamples the video in the temporal domain and adopts 3-D convolutional units with reduced channel capacities to capture multiresolution spatio-temporal information. Instead of learning these two branches independently, a shallow spatiotemporal downsampling module is shared for these two branches for efficient low-level feature learning. Besides, lateral connections are learned to effectively fuse the information from the two branches at multiple stages. The proposed network makes good balance between inference speed and action recognition performance. Based on RGB information only, it achieves competing performance on five popular video-based action recognition datasets, with inference speed of 3200 FPS on a single NVIDIA GTX 2080Ti card.

引用

页码：4584 / 4593

页数：10

共 50 条

[1] OctreeNet: A Novel Sparse 3-D Convolutional Neural Network for Real-Time 3-D Outdoor Scene Analysis
Wang, Fei
Zhuang, Yan
Gu, Hong
Hu, Huosheng
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2020, 17 (02) : 735 - 747
[2] Action Recognition by 3D Convolutional Network
Brezovsky, Matus
Sopiak, Dominik
Oravec, Milos
PROCEEDINGS OF ELMAR-2018: 60TH INTERNATIONAL SYMPOSIUM ELMAR-2018, 2018, : 71 - 74
[3] Infrared and 3D Skeleton Feature Fusion for RGB-D Action Recognition
De Boissiere, Alban Main
Noumeir, Rita
IEEE ACCESS, 2020, 8 (08): : 168297 - 168308
[4] Sagitta: An Energy-Efficient Sparse 3D-CNN Accelerator for Real-Time 3-D Understanding
Zhou, Changchun
Liu, Min
Qiu, Siyuan
Cao, Xugang
Fu, Yuzhe
He, Yifan
Jiao, Hailong
IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (23): : 20703 - 20717
[5] Real-time 3-D sensing, visualization and recognition of dynamic biological microorganisms
Yeom, S
Moon, I
Javidi, B
PROCEEDINGS OF THE IEEE, 2006, 94 (03) : 550 - 566
[6] Skeleton-Based Square Grid for Human Action Recognition With 3D Convolutional Neural Network
Ding, Wenwen
Ding, Chongyang
Li, Guang
Liu, Kai
IEEE ACCESS, 2021, 9 : 54078 - 54089
[7] 3-D Gabor Convolutional Neural Network for Hyperspectral Image Classification
Jia, Sen
Liao, Jianhui
Xu, Meng
Li, Yan
Zhu, Jiasong
Sun, Weiwei
Jia, Xiuping
Li, Qingquan
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[8] Efficient Parallel Inflated 3D Convolution Architecture for Action Recognition
Huang, Yukun
Guo, Yongcai
Gao, Chao
IEEE ACCESS, 2020, 8 : 45753 - 45765
[9] Body Joint Guided 3-D Deep Convolutional Descriptors for Action Recognition
Cao, Congqi
Zhang, Yifan
Zhang, Chunjie
Lu, Hanqing
IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (03) : 1095 - 1108
[10] Transformer-Based Multiscale 3-D Convolutional Network for Motor Imagery Classification
Su, Jingyu
An, Shan
Wang, Guoxin
Sun, Xinlin
Hao, Yushi
Li, Haoyu
Gao, Zhongke
IEEE SENSORS JOURNAL, 2025, 25 (05) : 8621 - 8630

← 1 2 3 4 5 →