Shifting More Attention to Video Salient Object Detection

被引：460

作者：

Fan, Deng-Ping ^{[1
]}

Wang, Wenguan ^{[2
]}

Cheng, Ming-Ming ^{[1
]}

Shen, Jianbing ^{[2
,3
]}

机构：

[1] Nankai Univ, CS, TKLNDST, Tianjin, Peoples R China

[2] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates

[3] Beijing Inst Technol, Beijing, Peoples R China

来源：

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年

关键词：

DETECTION MODEL; SEGMENTATION; OPTIMIZATION; DRIVEN; SCENE;

D O I：

10.1109/CVPR.2019.00875

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The last decade has witnessed a growing interest in video salient object detection (VSOD). However, the research community long-term lacked a well-established VSOD dataset representative of real dynamic scenes with high-quality annotations. To address this issue, we elaborately collected a visual-attention-consistent Densely Annotated VSOD (DAVSOD) dataset, which contains 226 videos with 23,938 frames that cover diverse realistic-scenes, objects, instances and motions. With corresponding real human eye fixation data, we obtain precise ground-truths. This is the first work that explicitly emphasizes the challenge of saliency shift, i.e., the video salient object(s) may dynamically change. To further contribute the community a complete benchmark, we systematically assess 17 representative VSOD algorithms over seven existing VSOD datasets and our DAVSOD with totally similar to 84K frames (largest-scale). Utilizing three famous metrics, we then present a comprehensive and insightful performance analysis. Furthermore, we propose a baseline model. It is equipped with a saliency shift-aware convLSTM, which can efficiently capture video saliency dynamics through learning human attention-shift behavior. Extensive experiments(1) open up promising future directions for model development and comparison.

引用

页码：8546 / 8556

页数：11

共 94 条

[1]

Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596

[2] Unsupervised Uncertainty Estimation Using Spatiotemporal Cues in Video Saliency Detection [J].