See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

被引：485

作者：

Lu, Xiankai ^{[1
]}

Wang, Wenguan ^{[1
]}

Ma, Chao ^{[2
]}

Shen, Jianbing ^{[1
]}

Shao, Ling ^{[1
]}

Porikli, Fatih ^{[3
]}

机构：

[1] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates

[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China

[3] Australian Natl Univ, Canberra, ACT, Australia

来源：

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年

基金：

澳大利亚研究理事会;

关键词：

D O I：

10.1109/CVPR.2019.00374

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

引用

页码：3618 / 3627

页数：10

共 69 条

[1]

[Anonymous], 2014, IEEE C COMP VIS PATT

[2]

[Anonymous], 2018, P IEEE C COMP VIS PA

[3]

[Anonymous], 2012, CVPR

[4]

BAI X, 2009, TOG, V28, DOI DOI 10.1145/1531326.1531376

[5] CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF [J].

Bao, Linchao ;

Wu, Baoyuan ;

Liu, Wei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5977-5986

[6]

Brox T, 2010, LECT NOTES COMPUT SC, V6315, P282, DOI 10.1007/978-3-642-15555-0_21

[7] One-Shot Video Object Segmentation [J].

Caelles, S. ;

Maninis, K. -K. ;

Pont-Tuset, J. ;

Leal-Taixe, L. ;

Cremers, D. ;

Van Gool, L. .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5320-5329

[8]

Chen L., 2017, CORR

[9] Fast and Accurate Online Video Object Segmentation via Tracking Parts [J].

Cheng, Jingchun ;

Tsai, Yi-Hsuan ;

Hung, Wei-Chih ;

Wang, Shengjin ;

Yang, Ming-Hsuan .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7415-7424

[10] Global Contrast based Salient Region Detection [J].

Cheng, Ming-Ming ;

Zhang, Guo-Xin ;

Mitra, Niloy J. ;

Huang, Xiaolei ;

Hu, Shi-Min .

2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011, :409-416

← 1 2 3 4 5 6 7 →