Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection

被引:88
作者
Chen, Chenglizhao [1 ]
Wang, Guotao [1 ]
Peng, Chong [1 ]
Fang, Yuming [2 ]
Zhang, Dingwen [3 ]
Qin, Hong [4 ]
机构
[1] Qingdao Univ, Coll Comp Sci & Technol, Qingdao 266000, Shandong, Peoples R China
[2] Jiangxi Univ Finance & Econ, Nanchang 330013, Jiangxi, Peoples R China
[3] Xidian Univ, Xian 710129, Peoples R China
[4] SUNY Stony Brook, Stony Brook, NY 11794 USA
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
Three-dimensional displays; Spatiotemporal phenomena; Convolution; Optical sensors; Decoding; Optical network units; Optical imaging; Video salient object detection; lightweight temporal unit; fast temporal shuffle; multiscale spatiotemporal deep features; OPTIMIZATION; IMAGE;
D O I
10.1109/TIP.2021.3068644
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We have witnessed a growing interest in video salient object detection (VSOD) techniques in today's computer vision applications. In contrast with temporal information (which is still considered a rather unstable source thus far), the spatial information is more stable and ubiquitous, thus it could influence our vision system more. As a result, the current main-stream VSOD approaches have inferred and obtained their saliency primarily from the spatial perspective, still treating temporal information as subordinate. Although the aforementioned methodology of focusing on the spatial aspect is effective in achieving a numeric performance gain, it still has two critical limitations. First, to ensure the dominance by the spatial information, its temporal counterpart remains inadequately used, though in some complex video scenes, the temporal information may represent the only reliable data source, which is critical to derive the correct VSOD. Second, both spatial and temporal saliency cues are often computed independently in advance and then integrated later on, while the interactions between them are omitted completely, resulting in saliency cues with limited quality. To combat these challenges, this paper advocates a novel spatiotemporal network, where the key innovation is the design of its temporal unit. Compared with other existing competitors (e.g., convLSTM), the proposed temporal unit exhibits an extremely lightweight design that does not degrade its strong ability to sense temporal information. Furthermore, it fully enables the computation of temporal saliency cues that interact with their spatial counterparts, ultimately boosting the overall VSOD performance and realizing its full potential towards mutual performance improvement for each. The proposed method is easy to implement yet still effective, achieving high-quality VSOD at 50 FPS in real-time applications.
引用
收藏
页码:3995 / 4007
页数:13
相关论文
共 53 条
  • [21] Unsupervised Video Object Segmentation with Motion-Based Bilateral Networks
    Li, Siyang
    Seybold, Bryan
    Vorobyov, Alexey
    Lei, Xuejing
    Kuo, C-C Jay
    [J]. COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 215 - 231
  • [22] The Secrets of Salient Object Segmentation
    Li, Yin
    Hou, Xiaodi
    Koch, Christof
    Rehg, James M.
    Yuille, Alan L.
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 280 - 287
  • [23] Accurate and Robust Video Saliency Detection via Self-Paced Diffusion
    Li, Yunxiao
    Li, Shuai
    Chen, Chenglizhao
    Hao, Aimin
    Qin, Hong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (05) : 1153 - 1167
  • [24] A Simple Pooling-Based Design for Real-Time Salient Object Detection
    Liu, Jiang-Jiang
    Hou, Qibin
    Cheng, Ming-Ming
    Feng, Jiashi
    Jiang, Jianmin
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3912 - 3921
  • [25] Saliency Detection for Unconstrained Videos Using Superpixel-Level Graph and Spatiotemporal Propagation
    Liu, Zhi
    Li, Junhao
    Ye, Linwei
    Sun, Guangling
    Shen, Liquan
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2017, 27 (12) : 2527 - 2542
  • [26] Superpixel-Based Spatiotemporal Saliency Detection
    Liu, Zhi
    Zhang, Xiang
    Luo, Shuhua
    Le Meur, Olivier
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2014, 24 (09) : 1522 - 1540
  • [27] See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks
    Lu, Xiankai
    Wang, Wenguan
    Ma, Chao
    Shen, Jianbing
    Shao, Ling
    Porikli, Fatih
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3618 - 3627
  • [28] Stage-wise Salient Object Detection in 360° Omnidirectional Image via Object-level Semantical Saliency Ranking
    Ma, Guangxiao
    Li, Shuai
    Chen, Chenglizhao
    Hao, Aimin
    Qin, Hong
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2020, 26 (12) : 3535 - 3545
  • [29] Salient Object Detection via Multiple Instance Joint Re-Learning
    Ma, Guangxiao
    Chen, Chenglizhao
    Li, Shuai
    Peng, Chong
    Hao, Aimin
    Qin, Hong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (02) : 324 - 336
  • [30] Segmentation of Moving Objects by Long Term Video Analysis
    Ochs, Peter
    Malik, Jitendra
    Brox, Thomas
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (06) : 1187 - 1200