Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling

被引：42

作者：

Fan, Hehe ^{[1
]}

Yang, Yi ^{[2
]}

Kankanhalli, Mohan ^{[1
]}

机构：

[1] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore

[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Zhejiang, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 02期

关键词：

Point cloud compression; Three-dimensional displays; Transformers; Encoding; Computational modeling; Adaptation models; Solid modeling; Action recognition; point cloud; semantic segmentation; spatio-temporal modeling; video analysis; ACTION RECOGNITION;

D O I：

10.1109/TPAMI.2022.3161735

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Due to the inherent unorderliness and irregularity of point cloud, points emerge inconsistently across different frames in a point cloud video. To capture the dynamics in point cloud videos, tracking points and limiting temporal modeling range are usually employed to preserve spatio-temporal structure. However, as points may flow in and out across frames, computing accurate point trajectories is extremely difficult, especially for long videos. Moreover, when points move fast, even in a small temporal window, points may still escape from a region. Besides, using the same temporal range for different motions may not accurately capture the temporal structure. In this paper, we propose a Point Spatio-Temporal Transformer (PST-Transformer). To preserve the spatio-temporal structure, PST-Transformer adaptively searches related or similar points across the entire video by performing self-attention on point features. Moreover, our PST-Transformer is equipped with an ability to encode spatio-temporal structure. Because point coordinates are irregular and unordered but point timestamps exhibit regularities and order, the spatio-temporal encoding is decoupled to reduce the impact of the spatial irregularity on the temporal modeling. By properly preserving and encoding spatio-temporal structure, our PST-Transformer effectively models point cloud videos and shows superior performance on 3D action recognition and 4D semantic segmentation.

引用

页码：2181 / 2192

页数：12

共 60 条

[1]

Ba Jimmy Lei, 2016, LAYER NORMALIZATION

[2] SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition [J].

Caetano, Carlos ;

Sena, Jessica ;

Bremond, Francois ;

dos Santos, Jefersson A. ;

Schwartz, William Robson .

2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,

[3] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5] 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks [J].

Choy, Christopher ;

Gwak, JunYoung ;

Savarese, Silvio .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3070-3079

[6]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

diaeresis>aser Alexander Kl<spacing, 2008, P BRIT MACH VIS C SE

[9]

Dosovitskiy A., 2021, INT C LEARN REPRESEN, DOI 10.48550/arXiv.2010.11929

[10] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

← 1 2 3 4 5 6 →