Combining Global and Local Attention with Positional Encoding for Video Summarization

被引:42
作者
Apostolidis, Evlampios [1 ,2 ]
Balaouras, Georgios [1 ]
Mezaris, Vasileios [1 ]
Patras, Ioannis [3 ]
机构
[1] CERTH ITI, Thessaloniki 57001, Greece
[2] Queen Mary Univ London, Thessaloniki 57001, Greece
[3] Queen Mary Univ London, London E1 4NS, England
来源
23RD IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2021) | 2021年
基金
欧盟地平线“2020”; 英国工程与自然科学研究理事会;
关键词
video summarization; self-attention; multi-head attention; positional encoding; supervised learning;
D O I
10.1109/ISM52913.2021.00045
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a new method for supervised video summarization. To overcome drawbacks of existing RNN-based summarization architectures, that relate to the modeling of long-range frames' dependencies and the ability to parallelize the training process, the developed model relies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames' dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames' dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. An ablation study that focuses on our main proposed components, namely the use of global and local multi-head attention mechanisms in collaboration with an absolute positional encoding component, shows their relative contributions to the overall summarization performance.
引用
收藏
页码:226 / 234
页数:9
相关论文
共 33 条
  • [1] Apostolidis E., 2021, P IEEE
  • [2] Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods
    Apostolidis, Evlampios
    Adamantidou, Eleni
    Metsai, Alexandros, I
    Mezaris, Vasileios
    Patras, Ioannis
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1056 - 1064
  • [3] Carreira J., 2017 IEEE C COMP VIS, P4724
  • [4] Video Summarization with LSTM and Deep Attention Models
    Casas, Luis Lebron
    Koblents, Eugenia
    [J]. MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 67 - 79
  • [5] Cho K., 2014, P C EMP METH NAT LAN, P1724, DOI DOI 10.3115/V1/D14-1179
  • [6] Chu W.-T., 2019, MMSP, P1
  • [7] Video Summarization via Actionness Ranking
    Elfeki, Mohamed
    Borji, Ali
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 754 - 763
  • [8] Summarizing Videos with Attention
    Fajtl, Jiri
    Sokeh, Hajar Sadeghi
    Argyriou, Vasileios
    Monekosso, Dorothy
    Remagnino, Paolo
    [J]. COMPUTER VISION - ACCV 2018 WORKSHOPS, 2019, 11367 : 39 - 54
  • [9] Extractive Video Summarizer with Memory Augmented Neural Networks
    Feng, Litong
    Li, Ziyin
    Kuang, Zhanghui
    Zhang, Wayne
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 976 - 983
  • [10] Attentive and Adversarial Learning for Video Summarization
    Fu, Tsu-Jui
    Tai, Shao-Heng
    Chen, Hwann-Tzong
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1579 - 1587