Video summarization with temporal-channel visual transformer

被引：0

作者：

Tian, Xiaoyan ^{[1
]}

Jin, Ye ^{[1
]}

Zhang, Zhao ^{[2
]}

Liu, Peng ^{[1
]}

Tang, Xianglong ^{[1
]}

机构：

[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China

[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China

来源：

PATTERN RECOGNITION | 2025年 / 165卷

基金：

中国国家自然科学基金; 黑龙江省自然科学基金;

关键词：

Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;

D O I：

10.1016/j.patcog.2025.111631

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.

引用

页数：15

共 50 条

[1] Temporal-channel cascaded transformer for imagined handwriting character recognition
Zhou, Wenhui
Wang, Yuhan
Mo, Liangyan
Li, Changsheng
Xu, Mingyue
Kong, Wanzeng
Dai, Guojun
NEUROCOMPUTING, 2024, 573
[2] Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving
Yuan, Zhenxun
Song, Xiao
Bai, Lei
Wang, Zhe
Ouyang, Wanli
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2068 - 2078
[3] Efficient Transformer for Video Summarization
Kolmakova, Tatiana
Makarov, Ilya
ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2023, PT II, 2023, 14135 : 52 - 65
[4] Efficient filtering and clustering methods for temporal video segmentation and visual summarization
Ferman, AM
Tekalp, AM
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 1998, 9 (04) : 336 - 351
[5] Video Summarization With Spatiotemporal Vision Transformer
Hsu, Tzu-Chun
Liao, Yi-Sheng
Huang, Chun-Rong
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3013 - 3026
[6] TCAMS-Trans: Efficient temporal-channel attention multi-scale transformer for net load forecasting
Zhang, Qingyong
Zhou, Shiyang
Xu, Bingrong
Li, Xinran
COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
[7] Video summarization with u-shaped transformer
Chen, Yaosen
Guo, Bing
Shen, Yan
Zhou, Renshuang
Lu, Weichen
Wang, Wei
Wen, Xuming
Suo, Xinhua
APPLIED INTELLIGENCE, 2022, 52 (15) : 17864 - 17880
[8] Video Summarization With Frame Index Vision Transformer
Hsu, Tzu-Chun
Liao, Yi-Sheng
Huang, Chun-Rong
PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
[9] Video summarization with u-shaped transformer
Yaosen Chen
Bing Guo
Yan Shen
Renshuang Zhou
Weichen Lu
Wei Wang
Xuming Wen
Xinhua Suo
Applied Intelligence, 2022, 52 : 17864 - 17880
[10] Video Co-summarization: Video Summarization by Visual Co-occurrence
Chu, Wen-Sheng
Song, Yale
Jaimes, Alejandro
2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 3584 - 3592

← 1 2 3 4 5 →