Temporal Transformer Networks With Self-Supervision for Action Recognition

被引:1
作者
Zhang, Yongkang [1 ]
Li, Jun [1 ]
Jiang, Na [1 ]
Wu, Guoming [1 ]
Zhang, Han [1 ]
Shi, Zhiping [1 ]
Liu, Zhaoxun [2 ]
Wu, Zizhang [3 ]
Liu, Xianglong [2 ]
机构
[1] Capital Normal Univ, Informat Engn Coll, Beijing 100048, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
[3] ZongMu Technol, Comp Vis Percept Dept, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Temporal modeling; temporal sequence self-supervision (TSS); temporal transformer; video action recogni-tion; video analysis in Internet of Things (IoT); REPRESENTATION;
D O I
10.1109/JIOT.2023.3257992
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, Internet of Things (IoT) has made rapid development, and IoT devices are developing toward intelligence. IoT terminal devices represented by surveillance cameras play an irreplaceable role in modern society, most of them are integrated with video action recognition and other intelligent functions. However, their performance is somewhat affected by the limitation of computing resources of IoT terminal devices and the lack of long-range nonlinear temporal relation modeling and reverse motion information modeling. To address this urgent problem, we introduce a startling temporal transformer network with self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision (TSS) module. Concisely speaking, we utilize the efficient temporal transformer module to model the nonlinear temporal dependencies among nonlocal frames, which significantly enhances complex motion feature representations. The TSS module we employ unprecedentedly adopts the streamlined strategy of "random batch random channel" to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used data sets (HMDB51, UCF101, and Something-Something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for video action recognition. Our TTSN provides the possibility for its application in IoT scenarios due to its computational complexity and high performance.
引用
收藏
页码:12999 / 13011
页数:13
相关论文
共 76 条
  • [51] Video Modeling with Correlation Networks
    Wang, Heng
    Tran, Du
    Torresani, Lorenzo
    Feiszli, Matt
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 349 - 358
  • [52] Wang J., 2020, arXiv
  • [53] TDN: Temporal Difference Networks for Efficient Action Recognition
    Wang, Limin
    Tong, Zhan
    Ji, Bin
    Wu, Gangshan
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1895 - 1904
  • [54] Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
    Wang, Limin
    Xiong, Yuanjun
    Wang, Zhe
    Qiao, Yu
    Lin, Dahua
    Tang, Xiaoou
    Van Gool, Luc
    [J]. COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 : 20 - 36
  • [55] Appearance-and-Relation Networks for Video Classification
    Wang, Limin
    Li, Wei
    Li, Wen
    Van Gool, Luc
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1430 - 1439
  • [56] Videos as Space-Time Region Graphs
    Wang, Xiaolong
    Gupta, Abhinav
    [J]. COMPUTER VISION - ECCV 2018, PT V, 2018, 11209 : 413 - 431
  • [57] Non-local Neural Networks
    Wang, Xiaolong
    Girshick, Ross
    Gupta, Abhinav
    He, Kaiming
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7794 - 7803
  • [58] Wang Z., 2022, WACV, P1819
  • [59] Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering
    Wang, Zhe
    Liu, Xiaoyi
    Chen, Liangjian
    Wang, Limin
    Qiao, Yu
    Xie, Xiaohui
    Fowlkes, Charless
    [J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1888 - 1896
  • [60] ACTION-Net: Multipath Excitation for Action Recognition
    Wang, Zhengwei
    She, Qi
    Smolic, Aljosa
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13209 - 13218