Temporal Transformer Networks With Self-Supervision for Action Recognition

被引：1

作者：

Zhang, Yongkang ^{[1
]}

Li, Jun ^{[1
]}

Jiang, Na ^{[1
]}

Wu, Guoming ^{[1
]}

Zhang, Han ^{[1
]}

Shi, Zhiping ^{[1
]}

Liu, Zhaoxun ^{[2
]}

Wu, Zizhang ^{[3
]}

Liu, Xianglong ^{[2
]}

机构：

[1] Capital Normal Univ, Informat Engn Coll, Beijing 100048, Peoples R China

[2] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China

[3] ZongMu Technol, Comp Vis Percept Dept, Shanghai 201203, Peoples R China

来源：

IEEE INTERNET OF THINGS JOURNAL | 2023年 / 10卷 / 14期

基金：

中国国家自然科学基金;

关键词：

Temporal modeling; temporal sequence self-supervision (TSS); temporal transformer; video action recogni-tion; video analysis in Internet of Things (IoT); REPRESENTATION;

D O I：

10.1109/JIOT.2023.3257992

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, Internet of Things (IoT) has made rapid development, and IoT devices are developing toward intelligence. IoT terminal devices represented by surveillance cameras play an irreplaceable role in modern society, most of them are integrated with video action recognition and other intelligent functions. However, their performance is somewhat affected by the limitation of computing resources of IoT terminal devices and the lack of long-range nonlinear temporal relation modeling and reverse motion information modeling. To address this urgent problem, we introduce a startling temporal transformer network with self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision (TSS) module. Concisely speaking, we utilize the efficient temporal transformer module to model the nonlinear temporal dependencies among nonlocal frames, which significantly enhances complex motion feature representations. The TSS module we employ unprecedentedly adopts the streamlined strategy of "random batch random channel" to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used data sets (HMDB51, UCF101, and Something-Something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for video action recognition. Our TTSN provides the possibility for its application in IoT scenarios due to its computational complexity and high performance.

引用

页码：12999 / 13011

页数：13

共 76 条

[51] Video Modeling with Correlation Networks
Wang, Heng
Tran, Du
Torresani, Lorenzo
Feiszli, Matt
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 349 - 358
[52] Wang J., 2020, arXiv
[53] TDN: Temporal Difference Networks for Efficient Action Recognition
Wang, Limin
Tong, Zhan
Ji, Bin
Wu, Gangshan
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1895 - 1904
[54] Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Wang, Limin
Xiong, Yuanjun
Wang, Zhe
Qiao, Yu
Lin, Dahua
Tang, Xiaoou
Van Gool, Luc
[J]. COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 : 20 - 36
[55] Appearance-and-Relation Networks for Video Classification
Wang, Limin
Li, Wei
Li, Wen
Van Gool, Luc
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1430 - 1439
[56] Videos as Space-Time Region Graphs
Wang, Xiaolong
Gupta, Abhinav
[J]. COMPUTER VISION - ECCV 2018, PT V, 2018, 11209 : 413 - 431
[57] Non-local Neural Networks
Wang, Xiaolong
Girshick, Ross
Gupta, Abhinav
He, Kaiming
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7794 - 7803
[58] Wang Z., 2022, WACV, P1819
[59] Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering
Wang, Zhe
Liu, Xiaoyi
Chen, Liangjian
Wang, Limin
Qiao, Yu
Xie, Xiaohui
Fowlkes, Charless
[J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1888 - 1896
[60] ACTION-Net: Multipath Excitation for Action Recognition
Wang, Zhengwei
She, Qi
Smolic, Aljosa
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13209 - 13218

← 1 2 3 4 5 6 7 8 →