Video summarization with temporal-channel visual transformer

被引：0

作者：

Tian, Xiaoyan ^{[1
]}

Jin, Ye ^{[1
]}

Zhang, Zhao ^{[2
]}

Liu, Peng ^{[1
]}

Tang, Xianglong ^{[1
]}

机构：

[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China

[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China

来源：

PATTERN RECOGNITION | 2025年 / 165卷

基金：

中国国家自然科学基金; 黑龙江省自然科学基金;

关键词：

Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;

D O I：

10.1016/j.patcog.2025.111631

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.

引用

页数：15

共 50 条

[41] Creating Video Visual Storyboard with Static Video Summarization using Fractional Energy of Orthogonal Transforms
Tonge, Ashvini
Thepade, Sudeep D.
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (09) : 265 - 273
[42] TcT: Temporal and channel Transformer for EEGbased Emotion Recognition
Liu, Yanling
Zhou, Yueying
Zhang, Daoqiang
[J]. 2022 IEEE 35TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2022, : 366 - 371
[43] Feature Pooling Using Spatio-Temporal Constrain for Video Summarization and Retrieval
Ren, Jie
Ren, Jinchang
[J]. ADVANCED MULTIMEDIA AND UBIQUITOUS ENGINEERING: FUTURETECH & MUE, 2016, 393 : 381 - 387
[44] Exploring global diverse attention via pairwise temporal relation for video summarization
Li, Ping
Ye, Qinghao
Zhang, Luming
Yuan, Li
Xu, Xianghua
Shao, Ling
[J]. PATTERN RECOGNITION, 2021, 111
[45] FastPerson: Enhancing Video-Based Learning through Video Summarization that Preserves Linguistic and Visual Contexts
Kawamura, Kazuki
Rekimoto, Jun
[J]. AUGMENTED HUMANS 2024, AHS 2024, 2024, : 205 - 216
[46] ShiftFormer: Spatial-Temporal Shift Operation in Video Transformer
Yang, Beiying
Zhu, Guibo
Ge, Guojing
Luo, Jinzhao
Wang, Jinqiao
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1895 - 1900
[47] Spatial-Temporal Transformer for Video Snapshot Compressive Imaging
Wang, Lishun
Cao, Miao
Zhong, Yong
Yuan, Xin
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 9072 - 9089
[48] Transformer with Spatio-Temporal Representation for Video Anomaly Detection
Sun, Xiaohu
Chen, Jinyi
Shen, Xulin
Li, Hongjun
[J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2022, 2022, 13813 : 213 - 222
[49] Echocardiogram video summarization
Ebadollahi, S
Chang, SF
Wu, H
Takoma, S
[J]. MEDICAL IMAGING 2001: ULTRASONIC IMAGING AND SIGNAL PROCESSING, 2001, 4325 : 492 - 501
[50] Energy efficient video summarization and transmission over a slow fading wireless channel
Li, Z
Zhai, F
Katsaggelos, AK
Pappas, TN
[J]. Image and Video Communications and Processing 2005, Pts 1 and 2, 2005, 5685 : 940 - 948

← 1 2 3 4 5 →