Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling

被引:42
作者
Fan, Hehe [1 ]
Yang, Yi [2 ]
Kankanhalli, Mohan [1 ]
机构
[1] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore
[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Zhejiang, Peoples R China
关键词
Point cloud compression; Three-dimensional displays; Transformers; Encoding; Computational modeling; Adaptation models; Solid modeling; Action recognition; point cloud; semantic segmentation; spatio-temporal modeling; video analysis; ACTION RECOGNITION;
D O I
10.1109/TPAMI.2022.3161735
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the inherent unorderliness and irregularity of point cloud, points emerge inconsistently across different frames in a point cloud video. To capture the dynamics in point cloud videos, tracking points and limiting temporal modeling range are usually employed to preserve spatio-temporal structure. However, as points may flow in and out across frames, computing accurate point trajectories is extremely difficult, especially for long videos. Moreover, when points move fast, even in a small temporal window, points may still escape from a region. Besides, using the same temporal range for different motions may not accurately capture the temporal structure. In this paper, we propose a Point Spatio-Temporal Transformer (PST-Transformer). To preserve the spatio-temporal structure, PST-Transformer adaptively searches related or similar points across the entire video by performing self-attention on point features. Moreover, our PST-Transformer is equipped with an ability to encode spatio-temporal structure. Because point coordinates are irregular and unordered but point timestamps exhibit regularities and order, the spatio-temporal encoding is decoupled to reduce the impact of the spatial irregularity on the temporal modeling. By properly preserving and encoding spatio-temporal structure, our PST-Transformer effectively models point cloud videos and shows superior performance on 3D action recognition and 4D semantic segmentation.
引用
收藏
页码:2181 / 2192
页数:12
相关论文
共 60 条
[31]   Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics [J].
Niemeyer, Michael ;
Mescheder, Lars ;
Oechsle, Michael ;
Geiger, Andreas .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :5378-5388
[32]   Joint Angles Similiarities and HOG2 for Action Recognition [J].
Ohn-Bar, Eshed ;
Trivedi, Mohan M. .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2013, :465-470
[33]   HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences [J].
Oreifej, Omar ;
Liu, Zicheng .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :716-723
[34]  
Prantl L., 2020, PROC INT C LEARN REP
[35]  
Qi C.R., 2017, Advances in Neural Information Processing Systems
[36]  
Rempe D., 2020, Advances in Neural Information Processing Systems, P13688
[37]   The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes [J].
Ros, German ;
Sellart, Laura ;
Materzynska, Joanna ;
Vazquez, David ;
Lopez, Antonio M. .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3234-3243
[38]   NTU RGB plus D: A Large Scale Dataset for 3D Human Activity Analysis [J].
Shahroudy, Amir ;
Liu, Jun ;
Ng, Tian-Tsong ;
Wang, Gang .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1010-1019
[39]   Skeleton-Based Action Recognition with Directed Graph Neural Networks [J].
Shi, Lei ;
Zhang, Yifan ;
Cheng, Jian ;
Lu, Hanqing .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7904-7913
[40]   Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition [J].
Shi, Lei ;
Zhang, Yifan ;
Cheng, Jian ;
Lu, Hanqing .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12018-12027