Video summarization with temporal-channel visual transformer

被引:0
作者
Tian, Xiaoyan [1 ]
Jin, Ye [1 ]
Zhang, Zhao [2 ]
Liu, Peng [1 ]
Tang, Xianglong [1 ]
机构
[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China
[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China
基金
中国国家自然科学基金; 黑龙江省自然科学基金;
关键词
Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;
D O I
10.1016/j.patcog.2025.111631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Creating Video Visual Storyboard with Static Video Summarization using Fractional Energy of Orthogonal Transforms
    Tonge, Ashvini
    Thepade, Sudeep D.
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (09) : 265 - 273
  • [42] TcT: Temporal and channel Transformer for EEGbased Emotion Recognition
    Liu, Yanling
    Zhou, Yueying
    Zhang, Daoqiang
    [J]. 2022 IEEE 35TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2022, : 366 - 371
  • [43] Feature Pooling Using Spatio-Temporal Constrain for Video Summarization and Retrieval
    Ren, Jie
    Ren, Jinchang
    [J]. ADVANCED MULTIMEDIA AND UBIQUITOUS ENGINEERING: FUTURETECH & MUE, 2016, 393 : 381 - 387
  • [44] Exploring global diverse attention via pairwise temporal relation for video summarization
    Li, Ping
    Ye, Qinghao
    Zhang, Luming
    Yuan, Li
    Xu, Xianghua
    Shao, Ling
    [J]. PATTERN RECOGNITION, 2021, 111
  • [45] FastPerson: Enhancing Video-Based Learning through Video Summarization that Preserves Linguistic and Visual Contexts
    Kawamura, Kazuki
    Rekimoto, Jun
    [J]. AUGMENTED HUMANS 2024, AHS 2024, 2024, : 205 - 216
  • [46] ShiftFormer: Spatial-Temporal Shift Operation in Video Transformer
    Yang, Beiying
    Zhu, Guibo
    Ge, Guojing
    Luo, Jinzhao
    Wang, Jinqiao
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1895 - 1900
  • [47] Spatial-Temporal Transformer for Video Snapshot Compressive Imaging
    Wang, Lishun
    Cao, Miao
    Zhong, Yong
    Yuan, Xin
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 9072 - 9089
  • [48] Transformer with Spatio-Temporal Representation for Video Anomaly Detection
    Sun, Xiaohu
    Chen, Jinyi
    Shen, Xulin
    Li, Hongjun
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 213 - 222
  • [49] Echocardiogram video summarization
    Ebadollahi, S
    Chang, SF
    Wu, H
    Takoma, S
    [J]. MEDICAL IMAGING 2001: ULTRASONIC IMAGING AND SIGNAL PROCESSING, 2001, 4325 : 492 - 501
  • [50] Energy efficient video summarization and transmission over a slow fading wireless channel
    Li, Z
    Zhai, F
    Katsaggelos, AK
    Pappas, TN
    [J]. Image and Video Communications and Processing 2005, Pts 1 and 2, 2005, 5685 : 940 - 948