APSNet: Toward Adaptive Point Sampling for Efficient 3D Action Recognition

被引:14
作者
Liu, Jiaheng [1 ]
Guo, Jinyang [2 ]
Xu, Dong [3 ,4 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, Beijing 100191, Peoples R China
[2] Beihang Univ, Inst Artificial Intelligence, Beijing 100191, Peoples R China
[3] Univ Sydney, Sch Elect & Informat Engn, Sydney, NSW 2006, Australia
[4] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
关键词
Three-dimensional displays; Point cloud compression; Feature extraction; Videos; Geometry; Data mining; Task analysis; 3D action recognition; point cloud; accuracy-efficiency trade-off; NET;
D O I
10.1109/TIP.2022.3193290
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Observing that it is still a challenging task to deploy 3D action recognition methods in real-world scenarios, in this work, we investigate the accuracy-efficiency trade-off for 3D action recognition. We first introduce a simple and efficient backbone network structure for 3D action recognition, in which we directly extract the geometry and motion representations from the raw point cloud videos through a set of simple operations (i.e., coordinate offset generation and mini-PoinNet). Based on the backbone network, we propose an end-to-end optimized network called adaptive point sampling network (APSNet) to achieve the accuracy-efficiency trade-off, which mainly consists of three stages: the coarse feature extraction stage, the decision making stage, and the fine feature extraction stage. In APSNet, we adaptively decide the optimal resolutions (i.e., the optimal number of points) for each pair of frames based on any input point cloud video under the given computational complexity constraint. Comprehensive experiments on multiple benchmark datasets demonstrate the effectiveness and efficiency of our newly proposed APSNet for 3D action recognition.
引用
收藏
页码:5287 / 5302
页数:16
相关论文
共 65 条
[1]   SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences [J].
Behley, Jens ;
Garbade, Martin ;
Milioto, Andres ;
Quenzel, Jan ;
Behnke, Sven ;
Stachniss, Cyrill ;
Gall, Juergen .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9296-9306
[2]   Action Recognition with Dynamic Image Networks [J].
Bilen, Hakan ;
Fernando, Basura ;
Gavves, Efstratios ;
Vedaldi, Andrea .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (12) :2799-2813
[3]   SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition [J].
Caetano, Carlos ;
Sena, Jessica ;
Bremond, Francois ;
dos Santos, Jefersson A. ;
Schwartz, William Robson .
2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,
[4]   Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis [J].
Dai, Angela ;
Qi, Charles Ruizhongtai ;
Niessner, Matthias .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6545-6554
[5]   Learning to Sample [J].
Dovrat, Oren ;
Lang, Itai ;
Avidan, Shai .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2755-2764
[6]  
Fan H., 2021, PROC INT C LEARN REP, P1
[7]   Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition [J].
Fan, Hehe ;
Yu, Xin ;
Yang, Yi ;
Kankanhalli, Mohan .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :9918-9930
[8]   Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos [J].
Fan, Hehe ;
Yang, Yi ;
Kankanhalli, Mohan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :14199-14208
[9]  
Fan HH, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P705
[10]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210