Video summarization with temporal-channel visual transformer

被引:0
|
作者
Tian, Xiaoyan [1 ]
Jin, Ye [1 ]
Zhang, Zhao [2 ]
Liu, Peng [1 ]
Tang, Xianglong [1 ]
机构
[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China
[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China
基金
中国国家自然科学基金; 黑龙江省自然科学基金;
关键词
Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;
D O I
10.1016/j.patcog.2025.111631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks
    Zhu, Rui-Jie
    Zhang, Malu
    Zhao, Qihang
    Deng, Haoyu
    Duan, Yule
    Deng, Liang-Jian
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 14
  • [22] Topic-aware video summarization using multimodal transformer
    Zhu, Yubo
    Zhao, Wentian
    Hua, Rui
    Wu, Xinxiao
    PATTERN RECOGNITION, 2023, 140
  • [23] TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks
    Zhu, Rui-Jie
    Zhang, Malu
    Zhao, Qihang
    Deng, Haoyu
    Duan, Yule
    Deng, Liang-Jian
    arXiv, 2022,
  • [24] Video summarization by spatial-temporal graph optimization
    Lu, S
    Lyu, MR
    King, I
    2004 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL 2, PROCEEDINGS, 2004, : 197 - 200
  • [25] Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition
    Liang, Chengwu
    Yang, Jie
    Du, Ruolin
    Hu, Wei
    Hou, Ning
    IEEE ACCESS, 2024, 12 : 64937 - 64948
  • [26] A audio-visual model for efficient video summarization
    El-Nagar, Gamal
    El-Sawy, Ahmed
    Rashad, Metwally
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
  • [27] Effective Video Summarization Approach Based on Visual Attention
    Ahmad, Hilal
    Khan, Habib Ullah
    Ali, Sikandar
    Rahman, Syed Ijaz Ur
    Wahid, Fazli
    Khattak, Hizbullah
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 1427 - 1442
  • [28] Visual Summarization of Lecture Video Segments for Enhanced Navigation
    Rahman, Mohammad Rajiur
    Shah, Shishir
    Subhlok, Jaspal
    2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 154 - 157
  • [29] Unsupervised learning of visual and semantic features for video summarization
    Huang, Yansen
    Zhong, Rui
    Yao, Wenjin
    Wang, Rui
    2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
  • [30] High Definition Visual Attention based Video Summarization
    Qian, Yiming
    Kyan, Matthew
    PROCEEDINGS OF THE 2014 9TH INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS (VISAPP), VOL 1, 2014, : 634 - 640