Video summarization with temporal-channel visual transformer

被引:0
|
作者
Tian, Xiaoyan [1 ]
Jin, Ye [1 ]
Zhang, Zhao [2 ]
Liu, Peng [1 ]
Tang, Xianglong [1 ]
机构
[1] Harbin Inst Technol, Fac Comp, Harbin 150001, Peoples R China
[2] Harbin Inst Technol, Sch Instrument Sci & Engn, Harbin 150001, Peoples R China
基金
中国国家自然科学基金; 黑龙江省自然科学基金;
关键词
Video summarization; Transformer; Dual-stream embedding; Temporal-channel inter-frame correlation; Intra-segment representation; NETWORK;
D O I
10.1016/j.patcog.2025.111631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Temporal-channel cascaded transformer for imagined handwriting character recognition
    Zhou, Wenhui
    Wang, Yuhan
    Mo, Liangyan
    Li, Changsheng
    Xu, Mingyue
    Kong, Wanzeng
    Dai, Guojun
    NEUROCOMPUTING, 2024, 573
  • [2] Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving
    Yuan, Zhenxun
    Song, Xiao
    Bai, Lei
    Wang, Zhe
    Ouyang, Wanli
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2068 - 2078
  • [3] Efficient Transformer for Video Summarization
    Kolmakova, Tatiana
    Makarov, Ilya
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2023, PT II, 2023, 14135 : 52 - 65
  • [4] Efficient filtering and clustering methods for temporal video segmentation and visual summarization
    Ferman, AM
    Tekalp, AM
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 1998, 9 (04) : 336 - 351
  • [5] Video Summarization With Spatiotemporal Vision Transformer
    Hsu, Tzu-Chun
    Liao, Yi-Sheng
    Huang, Chun-Rong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3013 - 3026
  • [6] TCAMS-Trans: Efficient temporal-channel attention multi-scale transformer for net load forecasting
    Zhang, Qingyong
    Zhou, Shiyang
    Xu, Bingrong
    Li, Xinran
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
  • [7] Video summarization with u-shaped transformer
    Chen, Yaosen
    Guo, Bing
    Shen, Yan
    Zhou, Renshuang
    Lu, Weichen
    Wang, Wei
    Wen, Xuming
    Suo, Xinhua
    APPLIED INTELLIGENCE, 2022, 52 (15) : 17864 - 17880
  • [8] Video Summarization With Frame Index Vision Transformer
    Hsu, Tzu-Chun
    Liao, Yi-Sheng
    Huang, Chun-Rong
    PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
  • [9] Video summarization with u-shaped transformer
    Yaosen Chen
    Bing Guo
    Yan Shen
    Renshuang Zhou
    Weichen Lu
    Wei Wang
    Xuming Wen
    Xinhua Suo
    Applied Intelligence, 2022, 52 : 17864 - 17880
  • [10] Video Co-summarization: Video Summarization by Visual Co-occurrence
    Chu, Wen-Sheng
    Song, Yale
    Jaimes, Alejandro
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 3584 - 3592