Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification

被引:87
|
作者
Peng, Yuxin [1 ]
Zhao, Yunzhen [1 ]
Zhang, Junchao [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China
基金
中国国家自然科学基金;
关键词
Video classification; static-motion collaborative learning; spatial-temporal attention; adaptively weighted learning; ACTION RECOGNITION; REPRESENTATION; HISTOGRAMS; FLOW;
D O I
10.1109/TCSVT.2018.2808685
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video classification is highly important and has widespread applications, such as video search and intelligent surveillance. Video naturally contains both static and motion information, which can be represented by frames and optical flow, respectively. Recently, researchers have generally adopted deep networks to capture the static and motion information separately, which has two main limitations. First, the coexistence relationship between spatial and temporal attention is ignored, although they should be jointly modeled as the spatial and temporal evolutions of video to learn discriminative video features. Second, the strong complementarity between static and motion information is ignored, although they should be collaboratively learned to enhance each other. To address the above two limitations, this paper proposes the two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach, which consists of two models. First, for the spatial-temporal attention model, the spatial-level attention emphasizes the salient regions in a frame, and the temporal-level attention exploits the discriminative frames in a video. They are mutually enhanced to jointly learn the discriminative static and motion features for better classification performance. Second, for the static-motion collaborative model, it not only achieves mutual guidance between static and motion information to enhance the feature learning but also adaptively learns the fusion weights of static and motion streams, thus exploiting the strong complementarity between static and motion information to improve video classification. Experiments on four widely used data sets show that our TCLSTA approach achieves the best performance compared with more than 10 state-of-the-art methods.
引用
收藏
页码:773 / 786
页数:14
相关论文
共 50 条
  • [1] Video Saliency Prediction Based on Spatial-Temporal Two-Stream Network
    Zhang, Kao
    Chen, Zhenzhong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (12) : 3544 - 3557
  • [2] Spatial-Temporal Attention Two-Stream Convolution Neural Network for Smoke Region Detection
    Ding, Zhipeng
    Zhao, Yaqin
    Li, Ao
    Zheng, Zhaoxiang
    FIRE-SWITZERLAND, 2021, 4 (04):
  • [3] Spatial-temporal interaction learning based two-stream network for action recognition
    Liu, Tianyu
    Ma, Yujun
    Yang, Wenhan
    Ji, Wanting
    Wang, Ruili
    Jiang, Ping
    INFORMATION SCIENCES, 2022, 606 : 864 - 876
  • [4] Two-Stream Video Classification with Cross-Modality Attention
    Chi, Lu
    Tian, Guiyu
    Mu, Yadong
    Tian, Qi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4511 - 4520
  • [5] Two-Stream Spatial-Temporal Auto-Encoder With Adversarial Training for Video Anomaly Detection
    Guo, Biao
    Liu, Mingrui
    He, Qian
    Jiang, Ming
    IEEE ACCESS, 2024, 12 : 125881 - 125889
  • [6] Spatial-Temporal Analysis-Based Video Quality Assessment: A Two-Stream Convolutional Network Approach
    He, Jianghui
    Wang, Zhe
    Liu, Yi
    Song, Yang
    ELECTRONICS, 2024, 13 (10)
  • [7] Two-stream deep spatial-temporal auto-encoder for surveillance video abnormal event detection
    Li, Tong
    Chen, Xinyue
    Zhu, Fushun
    Zhang, Zhengyu
    Yan, Hua
    NEUROCOMPUTING, 2021, 439 (439) : 256 - 270
  • [8] Two-Stream Spatial-Temporal Feature Extraction and Classification Model for Anomaly Event Detection Using Hybrid Deep Learning Architectures
    Mangai, P.
    Geetha, M. Kalaiselvi
    Kumaravelan, G.
    INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2024, 24 (06)
  • [9] STA-GCN: two-stream graph convolutional network with spatial-temporal attention for hand gesture recognition
    Zhang, Wei
    Lin, Zeyi
    Cheng, Jian
    Ma, Cuixia
    Deng, Xiaoming
    Wang, Hongan
    VISUAL COMPUTER, 2020, 36 (10-12): : 2433 - 2444
  • [10] A two-stream network with joint spatial-temporal distance for video-based person re-identification
    Han, Zhisong
    Liang, Yaling
    Chen, Zengqun
    Zhou, Zhiheng
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (03) : 3769 - 3781