Video Saliency Prediction Based on Spatial-Temporal Two-Stream Network

被引：61

作者：

Zhang, Kao ^{[1
]}

Chen, Zhenzhong ^{[1
]}

机构：

[1] Wuhan Univ, Sch Remote Sensing & Informat Engn, Wuhan 430079, Hubei, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2019年 / 29卷 / 12期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Feature extraction; Predictive models; Streaming media; Visualization; Spatiotemporal phenomena; Computational modeling; Video saliency; spatial-temporal features; visual attention; deep learning; SPATIOTEMPORAL SALIENCY; COMPRESSED-DOMAIN; VISUAL-ATTENTION; MODEL; GAZE;

D O I：

10.1109/TCSVT.2018.2883305

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this paper, we propose a novel two-stream neural network for video saliency prediction. Unlike some traditional methods based on hand-crafted feature extraction and integration, our proposed method automatically learns saliency related spatiotemporal features from human fixations without any pre-processing, post-processing, or manual tuning. Video frames are routed through the spatial stream network to compute static or color saliency maps for each of them. And a new two-stage temporal stream network is proposed, which is composed of a pre-trained 2D-CNN model (SF-Net) to extract saliency related features and a shallow 3D-CNN model (Te-Net) to process these features, for temporal or dynamic saliency maps. It can reduce the requirement of video gaze data, improve training efficiency, and achieve high performance. A fusion network is adopted to combine the outputs of both streams and generate the final saliency maps. Besides, a convolutional Gaussian priors (CGP) layer is proposed to learn the bias phenomenon in viewing behavior to improve the performance of the video saliency prediction. The proposed method is compared with state-of-the-art saliency models on two public video saliency benchmark datasets. The results demonstrate that our model can achieve advanced performance on video saliency prediction.

引用

页码：3544 / 3557

页数：14

共 81 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2]

Agarwal G., 2003, P IEEE INT C MULT EX

[3]

[Anonymous], 2016, 2016 IEEE/OES China Ocean Acoustics (COA)

[4]

[Anonymous], 2017, INT C LEARNING REPRE

[5]

[Anonymous], 2017, PREDICTING VIDEO SAL

[6] Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction [J].

Bak, Cagdas ;

Kocak, Aysun ;

Erdem, Erkut ;

Erdem, Aykut .

IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (07) :1688-1698

[7]

Borji A., 2015, CAT2000 LARGE SCALE

[8] Analysis of scores, datasets, and models in visual saliency prediction [J].

Borji, Ali ;

Tavakoli, Hamed R. ;

Sihite, Dicky N. ;

Itti, Laurent .

2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :921-928

[9]

Chaabouni S, 2016, IEEE IMAGE PROC, P1604, DOI 10.1109/ICIP.2016.7532629

[10]

Chollet F., 2015, Keras

← 1 2 3 4 5 6 7 8 9 →