Temporal Transformer Networks With Self-Supervision for Action Recognition

被引：1

作者：

Zhang, Yongkang ^{[1
]}

Li, Jun ^{[1
]}

Jiang, Na ^{[1
]}

Wu, Guoming ^{[1
]}

Zhang, Han ^{[1
]}

Shi, Zhiping ^{[1
]}

Liu, Zhaoxun ^{[2
]}

Wu, Zizhang ^{[3
]}

Liu, Xianglong ^{[2
]}

机构：

[1] Capital Normal Univ, Informat Engn Coll, Beijing 100048, Peoples R China

[2] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China

[3] ZongMu Technol, Comp Vis Percept Dept, Shanghai 201203, Peoples R China

来源：

IEEE INTERNET OF THINGS JOURNAL | 2023年 / 10卷 / 14期

基金：

中国国家自然科学基金;

关键词：

Temporal modeling; temporal sequence self-supervision (TSS); temporal transformer; video action recogni-tion; video analysis in Internet of Things (IoT); REPRESENTATION;

D O I：

10.1109/JIOT.2023.3257992

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, Internet of Things (IoT) has made rapid development, and IoT devices are developing toward intelligence. IoT terminal devices represented by surveillance cameras play an irreplaceable role in modern society, most of them are integrated with video action recognition and other intelligent functions. However, their performance is somewhat affected by the limitation of computing resources of IoT terminal devices and the lack of long-range nonlinear temporal relation modeling and reverse motion information modeling. To address this urgent problem, we introduce a startling temporal transformer network with self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision (TSS) module. Concisely speaking, we utilize the efficient temporal transformer module to model the nonlinear temporal dependencies among nonlocal frames, which significantly enhances complex motion feature representations. The TSS module we employ unprecedentedly adopts the streamlined strategy of "random batch random channel" to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used data sets (HMDB51, UCF101, and Something-Something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for video action recognition. Our TTSN provides the possibility for its application in IoT scenarios due to its computational complexity and high performance.

引用

页码：12999 / 13011

页数：13

共 76 条

[1] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[2] End-to-End Image Classification and Compression With Variational Autoencoders
Chamain, Lahiru D.
Qi, Siyu
Ding, Zhi
[J]. IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (21): : 21916 - 21931
[3] Transformer With Bidirectional GRU for Nonintrusive, Sensor-Based Activity Recognition in a Multiresident Environment
Chen, Dong
Yongchareon, Sira
Lai, Edmund M. -K.
Yu, Jian
Sheng, Quan Z.
Li, Yafeng
[J]. IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (23): : 23716 - 23727
[4] Cheng Ouyang, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12374), P762, DOI 10.1007/978-3-030-58526-6_45
[5] Spatio-temporal Channel Correlation Networks for Action Classification
Diba, Ali
Fayyaz, Mohsen
Sharma, Vivek
Arzani, M. Mahdi
Yousefzadeh, Rahman
Gall, Juergen
Van Gool, Luc
[J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 299 - 315
[6] Unsupervised Visual Representation Learning by Context Prediction
Doersch, Carl
Gupta, Abhinav
Efros, Alexei A.
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1422 - 1430
[7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[8] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[9] Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification
Du, Yang
Yuan, Chunfeng
Li, Bing
Zhao, Lili
Li, Yangxi
Hu, Weiming
[J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 388 - 404
[10] Fan Q., 2019, P ADV NEUR INF PROC, P1

← 1 2 3 4 5 6 7 8 →