Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification

被引:87
|
作者
Peng, Yuxin [1 ]
Zhao, Yunzhen [1 ]
Zhang, Junchao [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China
基金
中国国家自然科学基金;
关键词
Video classification; static-motion collaborative learning; spatial-temporal attention; adaptively weighted learning; ACTION RECOGNITION; REPRESENTATION; HISTOGRAMS; FLOW;
D O I
10.1109/TCSVT.2018.2808685
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video classification is highly important and has widespread applications, such as video search and intelligent surveillance. Video naturally contains both static and motion information, which can be represented by frames and optical flow, respectively. Recently, researchers have generally adopted deep networks to capture the static and motion information separately, which has two main limitations. First, the coexistence relationship between spatial and temporal attention is ignored, although they should be jointly modeled as the spatial and temporal evolutions of video to learn discriminative video features. Second, the strong complementarity between static and motion information is ignored, although they should be collaboratively learned to enhance each other. To address the above two limitations, this paper proposes the two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach, which consists of two models. First, for the spatial-temporal attention model, the spatial-level attention emphasizes the salient regions in a frame, and the temporal-level attention exploits the discriminative frames in a video. They are mutually enhanced to jointly learn the discriminative static and motion features for better classification performance. Second, for the static-motion collaborative model, it not only achieves mutual guidance between static and motion information to enhance the feature learning but also adaptively learns the fusion weights of static and motion streams, thus exploiting the strong complementarity between static and motion information to improve video classification. Experiments on four widely used data sets show that our TCLSTA approach achieves the best performance compared with more than 10 state-of-the-art methods.
引用
收藏
页码:773 / 786
页数:14
相关论文
共 50 条
  • [21] Structured Two-Stream Attention Network for Video Question Answering
    Gao, Lianli
    Zeng, Pengpeng
    Song, Jingkuan
    Li, Yuan-Fang
    Liu, Wu
    Mei, Tao
    Shen, Heng Tao
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 6391 - 6398
  • [22] Two-stream Graph Attention Convolutional for Video Action Recognition
    Zhang, Deyuan
    Gao, Hongwei
    Dai, Hailong
    Shi, Xiangbin
    2021 IEEE 15TH INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (BIGDATASE 2021), 2021, : 23 - 27
  • [23] Skeleton-based emotion recognition based on two-stream self-attention enhanced spatial-temporal graph convolutional network
    Shi, Jiaqi
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Ishiguro, Hiroshi
    Sensors (Switzerland), 2021, 21 (01): : 1 - 16
  • [24] Skeleton-Based Emotion Recognition Based on Two-Stream Self-Attention Enhanced Spatial-Temporal Graph Convolutional Network
    Shi, Jiaqi
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Ishiguro, Hiroshi
    SENSORS, 2021, 21 (01) : 1 - 16
  • [25] Temporal Shift and Spatial Attention-Based Two-Stream Network for Traffic Risk Assessment
    Liu, Chunsheng
    Li, Zijian
    Chang, Faliang
    Li, Shuang
    Xie, Jincan
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (08) : 12518 - 12530
  • [26] COLLABORATIVE SPATIAL-TEMPORAL DISTILLATION FOR EFFICIENT VIDEO DERAINING
    Hu, Yuzhang
    Liu, Minghao
    Yang, Wenhan
    Liu, Jiaying
    Guo, Zongming
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1937 - 1942
  • [27] Backdoor Two-Stream Video Models on Federated Learning
    Zhao, Jing
    Yang, Hongwei
    He, Hui
    Peng, Jie
    Zhang, Weizhe
    Ni, Jiangqun
    Sangaiah, Arun kumar
    Castiglione, Anielo
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (11)
  • [28] Spatial-Temporal Separable Attention for Video Action Recognition
    Guo, Xi
    Hu, Yikun
    Chen, Fang
    Jin, Yuhui
    Qiao, Jian
    Huang, Jian
    Yang, Qin
    2022 INTERNATIONAL CONFERENCE ON FRONTIERS OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, FAIML, 2022, : 224 - 228
  • [29] Using Spatial-Temporal Attention for Video Quality Evaluation
    Chi, Biwei
    Su, Ruifang
    Chen, Xinhui
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2024, 2024
  • [30] STAT: Spatial-Temporal Attention Mechanism for Video Captioning
    Yan, Chenggang
    Tu, Yunbin
    Wang, Xingzheng
    Zhang, Yongbing
    Hao, Xinhong
    Zhang, Yongdong
    Dai, Qionghai
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (01) : 229 - 241