Combining Global and Local Attention with Positional Encoding for Video Summarization

被引：42

作者：

Apostolidis, Evlampios ^{[1
,2
]}

Balaouras, Georgios ^{[1
]}

Mezaris, Vasileios ^{[1
]}

Patras, Ioannis ^{[3
]}

机构：

[1] CERTH ITI, Thessaloniki 57001, Greece

[2] Queen Mary Univ London, Thessaloniki 57001, Greece

[3] Queen Mary Univ London, London E1 4NS, England

来源：

23RD IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2021) | 2021年

基金：

欧盟地平线“2020”; 英国工程与自然科学研究理事会;

关键词：

video summarization; self-attention; multi-head attention; positional encoding; supervised learning;

D O I：

10.1109/ISM52913.2021.00045

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a new method for supervised video summarization. To overcome drawbacks of existing RNN-based summarization architectures, that relate to the modeling of long-range frames' dependencies and the ability to parallelize the training process, the developed model relies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames' dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames' dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. An ablation study that focuses on our main proposed components, namely the use of global and local multi-head attention mechanisms in collaboration with an absolute positional encoding component, shows their relative contributions to the overall summarization performance.

引用

页码：226 / 234

页数：9

共 33 条

[1] Apostolidis E., 2021, P IEEE
[2] Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods
Apostolidis, Evlampios
Adamantidou, Eleni
Metsai, Alexandros, I
Mezaris, Vasileios
Patras, Ioannis
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1056 - 1064
[3] Carreira J., 2017 IEEE C COMP VIS, P4724
[4] Video Summarization with LSTM and Deep Attention Models
Casas, Luis Lebron
Koblents, Eugenia
[J]. MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 67 - 79
[5] Cho K., 2014, P C EMP METH NAT LAN, P1724, DOI DOI 10.3115/V1/D14-1179
[6] Chu W.-T., 2019, MMSP, P1
[7] Video Summarization via Actionness Ranking
Elfeki, Mohamed
Borji, Ali
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 754 - 763
[8] Summarizing Videos with Attention
Fajtl, Jiri
Sokeh, Hajar Sadeghi
Argyriou, Vasileios
Monekosso, Dorothy
Remagnino, Paolo
[J]. COMPUTER VISION - ACCV 2018 WORKSHOPS, 2019, 11367 : 39 - 54
[9] Extractive Video Summarizer with Memory Augmented Neural Networks
Feng, Litong
Li, Ziyin
Kuang, Zhanghui
Zhang, Wayne
[J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 976 - 983
[10] Attentive and Adversarial Learning for Video Summarization
Fu, Tsu-Jui
Tai, Shao-Heng
Chen, Hwann-Tzong
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1579 - 1587

← 1 2 3 4 →