Temporal Transformer Networks With Self-Supervision for Action Recognition

被引:1
作者
Zhang, Yongkang [1 ]
Li, Jun [1 ]
Jiang, Na [1 ]
Wu, Guoming [1 ]
Zhang, Han [1 ]
Shi, Zhiping [1 ]
Liu, Zhaoxun [2 ]
Wu, Zizhang [3 ]
Liu, Xianglong [2 ]
机构
[1] Capital Normal Univ, Informat Engn Coll, Beijing 100048, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
[3] ZongMu Technol, Comp Vis Percept Dept, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Temporal modeling; temporal sequence self-supervision (TSS); temporal transformer; video action recogni-tion; video analysis in Internet of Things (IoT); REPRESENTATION;
D O I
10.1109/JIOT.2023.3257992
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, Internet of Things (IoT) has made rapid development, and IoT devices are developing toward intelligence. IoT terminal devices represented by surveillance cameras play an irreplaceable role in modern society, most of them are integrated with video action recognition and other intelligent functions. However, their performance is somewhat affected by the limitation of computing resources of IoT terminal devices and the lack of long-range nonlinear temporal relation modeling and reverse motion information modeling. To address this urgent problem, we introduce a startling temporal transformer network with self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision (TSS) module. Concisely speaking, we utilize the efficient temporal transformer module to model the nonlinear temporal dependencies among nonlocal frames, which significantly enhances complex motion feature representations. The TSS module we employ unprecedentedly adopts the streamlined strategy of "random batch random channel" to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used data sets (HMDB51, UCF101, and Something-Something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for video action recognition. Our TTSN provides the possibility for its application in IoT scenarios due to its computational complexity and high performance.
引用
收藏
页码:12999 / 13011
页数:13
相关论文
共 76 条
  • [1] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [2] End-to-End Image Classification and Compression With Variational Autoencoders
    Chamain, Lahiru D.
    Qi, Siyu
    Ding, Zhi
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (21): : 21916 - 21931
  • [3] Transformer With Bidirectional GRU for Nonintrusive, Sensor-Based Activity Recognition in a Multiresident Environment
    Chen, Dong
    Yongchareon, Sira
    Lai, Edmund M. -K.
    Yu, Jian
    Sheng, Quan Z.
    Li, Yafeng
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (23): : 23716 - 23727
  • [4] Cheng Ouyang, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12374), P762, DOI 10.1007/978-3-030-58526-6_45
  • [5] Spatio-temporal Channel Correlation Networks for Action Classification
    Diba, Ali
    Fayyaz, Mohsen
    Sharma, Vivek
    Arzani, M. Mahdi
    Yousefzadeh, Rahman
    Gall, Juergen
    Van Gool, Luc
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 299 - 315
  • [6] Unsupervised Visual Representation Learning by Context Prediction
    Doersch, Carl
    Gupta, Abhinav
    Efros, Alexei A.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1422 - 1430
  • [7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
  • [8] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [9] Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification
    Du, Yang
    Yuan, Chunfeng
    Li, Bing
    Zhao, Lili
    Li, Yangxi
    Hu, Weiming
    [J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 388 - 404
  • [10] Fan Q., 2019, P ADV NEUR INF PROC, P1