Video summarization with temporal-channel visual transformer

被引：0

作者：

Tian, Xiaoyan ^{[1
]}

Jin, Ye ^{[1
]}

Zhang, Zhao ^{[2
]}

Liu, Peng ^{[1
]}

Tang, Xianglong ^{[1
]}

机构：

[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China

[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China

来源：

PATTERN RECOGNITION | 2025年 / 165卷

基金：

中国国家自然科学基金; 黑龙江省自然科学基金;

关键词：

Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;

D O I：

10.1016/j.patcog.2025.111631

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.

引用

页数：15

共 50 条

[21] TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks
Zhu, Rui-Jie
Zhang, Malu
Zhao, Qihang
Deng, Haoyu
Duan, Yule
Deng, Liang-Jian
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 14
[22] Topic-aware video summarization using multimodal transformer
Zhu, Yubo
Zhao, Wentian
Hua, Rui
Wu, Xinxiao
PATTERN RECOGNITION, 2023, 140
[23] TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks
Zhu, Rui-Jie
Zhang, Malu
Zhao, Qihang
Deng, Haoyu
Duan, Yule
Deng, Liang-Jian
arXiv, 2022,
[24] Video summarization by spatial-temporal graph optimization
Lu, S
Lyu, MR
King, I
2004 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL 2, PROCEEDINGS, 2004, : 197 - 200
[25] Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition
Liang, Chengwu
Yang, Jie
Du, Ruolin
Hu, Wei
Hou, Ning
IEEE ACCESS, 2024, 12 : 64937 - 64948
[26] A audio-visual model for efficient video summarization
El-Nagar, Gamal
El-Sawy, Ahmed
Rashad, Metwally
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
[27] Effective Video Summarization Approach Based on Visual Attention
Ahmad, Hilal
Khan, Habib Ullah
Ali, Sikandar
Rahman, Syed Ijaz Ur
Wahid, Fazli
Khattak, Hizbullah
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 1427 - 1442
[28] Visual Summarization of Lecture Video Segments for Enhanced Navigation
Rahman, Mohammad Rajiur
Shah, Shishir
Subhlok, Jaspal
2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 154 - 157
[29] Unsupervised learning of visual and semantic features for video summarization
Huang, Yansen
Zhong, Rui
Yao, Wenjin
Wang, Rui
2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
[30] High Definition Visual Attention based Video Summarization
Qian, Yiming
Kyan, Matthew
PROCEEDINGS OF THE 2014 9TH INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS (VISAPP), VOL 1, 2014, : 634 - 640

← 1 2 3 4 5 →