Video summarization with temporal-channel visual transformer

被引:0
作者
Tian, Xiaoyan [1 ]
Jin, Ye [1 ]
Zhang, Zhao [2 ]
Liu, Peng [1 ]
Tang, Xianglong [1 ]
机构
[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China
[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China
基金
中国国家自然科学基金; 黑龙江省自然科学基金;
关键词
Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;
D O I
10.1016/j.patcog.2025.111631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] CONTENT BASED VIDEO SUMMARIZATION: FINDING INTERESTING TEMPORAL SEQUENCES OF FRAMES
    Datt, Madhav
    Mukhopadhyay, Jayanta
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1268 - 1272
  • [32] Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization
    Zhu, Wencheng
    Han, Yucheng
    Lu, Jiwen
    Zhou, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3017 - 3031
  • [33] Temporal-channel convolution with self-attention network for human activity recognition using wearable sensors
    Essa, Ehab
    Abdelmaksoud, Islam R.
    KNOWLEDGE-BASED SYSTEMS, 2023, 278
  • [34] Integrate the Temporal Scheme for Unsupervised Video Summarization via Attention Mechanism
    Bang, Vo Quoc
    Viet, Vo Hoai
    IEEE ACCESS, 2025, 13 : 38147 - 38162
  • [35] Video Semantic Segmentation via Sparse Temporal Transformer
    Li, Jiangtong
    Wang, Wentao
    Chen, Junjie
    Niu, Li
    Si, Jianlou
    Qian, Chen
    Zhang, Liqing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 59 - 68
  • [36] A structure-transfer-driven temporal subspace clustering for video summarization
    Jing Zhang
    Yue Shi
    Peiguang Jing
    Jing Liu
    Yuting Su
    Multimedia Tools and Applications, 2019, 78 : 24123 - 24145
  • [37] Automatic video summarization driven by a spatio-temporal attention model
    Barland, R.
    Saadane, A.
    HUMAN VISION AND ELECTRONIC IMAGING XIII, 2008, 6806
  • [38] CONTENT ADAPTIVE VIDEO SUMMARIZATION USING SPATIO-TEMPORAL FEATURES
    Nam, Hyunwoo
    Yoo, Chang D.
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 4003 - 4007
  • [39] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
  • [40] Enhanced On-Device Video Summarization Using Audio and Visual Features
    Nagaraju, Lokesh Kumar Thandaga
    Ranjitha, B.
    Shaik, Jani Basha
    COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT I, 2024, 2009 : 86 - 98