CASNet: A Cross-Attention Siamese Network for Video Salient Object Detection

被引:93
作者
Ji, Yuzhu [1 ]
Zhang, Haijun [1 ]
Jie, Zequn [2 ]
Ma, Lin [2 ]
Wu, Q. M. Jonathan [3 ]
机构
[1] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[2] Tencent AI Lab, Shenzhen 518057, Peoples R China
[3] Univ Windsor, Dept Elect & Comp Engn, Windsor, ON N9B 3P4, Canada
基金
中国国家自然科学基金;
关键词
Object detection; Data models; Saliency detection; Feature extraction; Object oriented modeling; Computational modeling; Optical imaging; Cross attention; inter and intraframe saliency; salient object; video saliency; SEGMENTATION; OPTIMIZATION; FUSION;
D O I
10.1109/TNNLS.2020.3007534
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent works on video salient object detection have demonstrated that directly transferring the generalization ability of image-based models to video data without modeling spatial-temporal information remains nontrivial and challenging. Considering both intraframe accuracy and interframe consistency of saliency detection, this article presents a novel cross-attention based encoder-decoder model under the Siamese framework (CASNet) for video salient object detection. A baseline encoder-decoder model trained with Lovasz softmax loss function is adopted as a backbone network to guarantee the accuracy of intraframe salient object detection. Self- and cross-attention modules are incorporated into our model in order to preserve the saliency correlation and improve intraframe salient detection consistency. Extensive experimental results obtained by ablation analysis and cross-data set validation demonstrate the effectiveness of our proposed method. Quantitative results indicate that our CASNet model outperforms 19 state-of-the-art image- and video-based methods on six benchmark data sets.
引用
收藏
页码:2676 / 2690
页数:15
相关论文
共 68 条
[1]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[2]   The Lovasz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks [J].
Berman, Maxim ;
Triki, Amal Rannen ;
Blaschko, Matthew B. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4413-4421
[3]  
Borji Ali, 2019, [Computational Visual Media, 计算可视媒体], V5, P117
[4]  
Brox T, 2010, LECT NOTES COMPUT SC, V6315, P282, DOI 10.1007/978-3-642-15555-0_21
[5]   Video Saliency Detection via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion [J].
Chen, Chenglizhao ;
Li, Shuai ;
Wang, Yongguang ;
Qin, Hong ;
Hao, Aimin .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (07) :3156-3170
[6]  
Chen JZ, 2016, PROCEEDINGS OF 2016 12TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), P551, DOI [10.1109/CIS.2016.133, 10.1109/CIS.2016.0134]
[7]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[8]   SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].
Chen, Long ;
Zhang, Hanwang ;
Xiao, Jun ;
Nie, Liqiang ;
Shao, Jian ;
Liu, Wei ;
Chua, Tat-Seng .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306
[9]   Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model [J].
Cornia, Marcella ;
Baraldi, Lorenzo ;
Serra, Giuseppe ;
Cucchiara, Rita .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (10) :5142-5154
[10]   Long-Term Recurrent Convolutional Networks for Visual Recognition and Description [J].
Donahue, Jeff ;
Hendricks, Lisa Anne ;
Rohrbach, Marcus ;
Venugopalan, Subhashini ;
Guadarrama, Sergio ;
Saenko, Kate ;
Darrell, Trevor .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) :677-691