Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization

被引：45

作者：

Zhu, Wencheng ^{[1
,2
]}

Han, Yucheng ^{[1
,2
]}

Lu, Jiwen ^{[1
,2
]}

Zhou, Jie ^{[1
,2
]}

机构：

[1] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol BNRis, Beijing 100084, Peoples R China

[2] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Cognition; Proposals; Visualization; Feature extraction; Video sequences; Adversarial machine learning; Image edge detection; Video summarization; spatial-temporal representation; self-attention; graph pooling; relational reasoning;

D O I：

10.1109/TIP.2022.3163855

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose a dynamic graph modeling approach to learn spatial-temporal representations for video summarization. Most existing video summarization methods extract image-level features with ImageNet pre-trained deep models. Differently, our method exploits object-level and relation-level information to capture spatial-temporal dependencies. Specifically, our method builds spatial graphs on the detected object proposals. Then, we construct a temporal graph by using the aggregated representations of spatial graphs. Afterward, we perform relational reasoning over spatial and temporal graphs with graph convolutional networks and extract spatial-temporal representations for importance score prediction and key shot selection. To eliminate relation clutters caused by densely connected nodes, we further design a self-attention edge pooling module, which disregards meaningless relations of graphs. We conduct extensive experiments on two popular benchmarks, including the SumMe and TVSum datasets. Experimental results demonstrate that the proposed method achieves superior performance against state-of-the-art video summarization methods.

引用

页码：3017 / 3031

页数：15

共 84 条

[1]

[Anonymous], 2010, Image Analysis for Multimedia Interactive Services (WIAMIS), 2010 11th International Workshop on, DOI DOI 10.1109/WIC0M.2010.5601233

[2]

[Anonymous], 2015, P 3 INT C LEARN REPR

[3] Object Level Visual Reasoning in Videos [J].

Baradel, Fabien ;

Neverova, Natalia ;

Wolf, Christian ;

Mille, Julien ;

Mori, Greg .

COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :106-122

[4]

Battaglia PW, 2016, ADV NEUR IN, V29

[5]

Cangea C, 2018, P ADV NEURAL INF PRO, P1

[6] Iterative Visual Reasoning Beyond Convolutions [J].

Chen, Xinlei ;

Li, Li-Jia ;

Li Fei-Fei ;

Gupta, Abhinav .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7239-7248

[7] Heterogeneity Image Patch Index and Its Application to Consumer Video Summarization [J].

Dang, Chinh T. ;

Radha, Hayder .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014, 23 (06) :2704-2718

[8]

Defferrard M, 2016, ADV NEUR IN, V29

[9] Distinguishing enzyme structures from non-enzymes without alignments [J].

Dobson, PD ;

Doig, AJ .

JOURNAL OF MOLECULAR BIOLOGY, 2003, 330 (04) :771-783

[10] Online Summarization via Submodular and Convex Optimization [J].

Elhamifar, Ehsan ;

Kaluza, M. Clara De Paolis .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1818-1826

← 1 2 3 4 5 6 7 8 9 →