Video summarization with temporal-channel visual transformer

被引：0

作者：

Tian, Xiaoyan ^{[1
]}

Jin, Ye ^{[1
]}

Zhang, Zhao ^{[2
]}

Liu, Peng ^{[1
]}

Tang, Xianglong ^{[1
]}

机构：

[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China

[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China

来源：

PATTERN RECOGNITION | 2025年 / 165卷

基金：

中国国家自然科学基金; 黑龙江省自然科学基金;

关键词：

Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;

D O I：

10.1016/j.patcog.2025.111631

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.

引用

页数：15

共 50 条

[31] CONTENT BASED VIDEO SUMMARIZATION: FINDING INTERESTING TEMPORAL SEQUENCES OF FRAMES
Datt, Madhav
Mukhopadhyay, Jayanta
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1268 - 1272
[32] Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization
Zhu, Wencheng
Han, Yucheng
Lu, Jiwen
Zhou, Jie
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3017 - 3031
[33] Temporal-channel convolution with self-attention network for human activity recognition using wearable sensors
Essa, Ehab
Abdelmaksoud, Islam R.
KNOWLEDGE-BASED SYSTEMS, 2023, 278
[34] Integrate the Temporal Scheme for Unsupervised Video Summarization via Attention Mechanism
Bang, Vo Quoc
Viet, Vo Hoai
IEEE ACCESS, 2025, 13 : 38147 - 38162
[35] Video Semantic Segmentation via Sparse Temporal Transformer
Li, Jiangtong
Wang, Wentao
Chen, Junjie
Niu, Li
Si, Jianlou
Qian, Chen
Zhang, Liqing
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 59 - 68
[36] A structure-transfer-driven temporal subspace clustering for video summarization
Jing Zhang
Yue Shi
Peiguang Jing
Jing Liu
Yuting Su
Multimedia Tools and Applications, 2019, 78 : 24123 - 24145
[37] Automatic video summarization driven by a spatio-temporal attention model
Barland, R.
Saadane, A.
HUMAN VISION AND ELECTRONIC IMAGING XIII, 2008, 6806
[38] CONTENT ADAPTIVE VIDEO SUMMARIZATION USING SPATIO-TEMPORAL FEATURES
Nam, Hyunwoo
Yoo, Chang D.
2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 4003 - 4007
[39] Attention-Based Audio-Visual Fusion for Video Summarization
Fang, Yinghong
Zhang, Junpeng
Lu, Cewu
NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
[40] Enhanced On-Device Video Summarization Using Audio and Visual Features
Nagaraju, Lokesh Kumar Thandaga
Ranjitha, B.
Shaik, Jani Basha
COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT I, 2024, 2009 : 86 - 98

← 1 2 3 4 5 →