SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation

被引：3

作者：

Hong, Lingyi ^{[1
]}

Zhang, Wei ^{[1
]}

Gao, Shuyong ^{[1
]}

Lu, Hong ^{[1
]}

Zhang, WenQiang ^{[1
,2
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China

[2] Fudan Univ, Acad Engn & Technol, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

unsupervised video object segmentation; optical flow; one-stream structure;

D O I：

10.1145/3581783.3611804

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. However, this pipeline is computationally expensive and can lead to suboptimal performance due to the difficulty of fusing the two modalities properly. In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. Concretely, we design a novel SimulFlow Attention mechanism to bridege the image and motion by utilizing the flexibility of attention operation, where coarse masks predicted from fused feature at each stage are used to constrain the attention operation within the mask area and exclude the impact of noise. Because of the bidirectional information flow between visual and optical flow features in SimulFlow Attention, no extra hand-designed fusing module is required and we only adopt a light decoder to obtain the final prediction. We evaluate our method on several benchmark datasets and achieve state-of-the-art results. Our proposed approach not only outperforms existing methods but also addresses the computational complexity and fusion difficulties caused by two-stream architectures. Our models achieve 87.4% J& F on DAVIS-16 with the highest speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow also obtains competitive results on video salient object detection datasets.

引用

页码：7481 / 7490

页数：10

共 80 条

[21]

Hu Y.T., 2017, ADV NEUR IN, V30

[22] VideoMatch: Matching Based Video Object Segmentation [J].

Hu, Yuan-Ting ;

Huang, Jia-Bin ;

Schwing, Alexander G. .

COMPUTER VISION - ECCV 2018, PT VIII, 2018, 11212 :56-73

[23] Automatic foveation for video compression using a neurobiological model of visual attention [J].

Itti, L .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2004, 13 (10) :1304-1318

[24] Full-Duplex Strategy for Video Object Segmentation [J].

Ji, Ge-Peng ;

Fu, Keren ;

Wu, Zhe ;

Fan, Deng-Ping ;

Shen, Jianbing ;

Shao, Ling .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :4902-4913

[25] A Generative Appearance Model for End-to-end Video Object Segmentation [J].

Johnander, Joakim ;

Danelljan, Martin ;

Brissman, Emil ;

Khan, Fahad Shahbaz ;

Felsberg, Michael .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :8945-8954

[26]

Junliang Xing, 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), P1698, DOI 10.1109/ICPR.2010.420

[27]

Krahenbuhl P., 2011, Advances in Neural Information Processing Systems, P109, DOI DOI 10.48550/ARXIV.1210.5644

[28] Outage Probability Analysis for Downlink Interference-Limited Wireless Edge Networks with Caching, Computing, and Communications [J].

Lee, Ming-Chun ;

Molisch, Andreas F. .

ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, :5922-5927

[29]

Lee Y, 2022, AAAI CONF ARTIF INTE, P1245

[30] Video Segmentation by Tracking Many Figure-Ground Segments [J].

Li, Fuxin ;

Kim, Taeyoung ;

Humayun, Ahmad ;

Tsai, David ;

Rehg, James M. .

2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :2192-2199

← 1 2 3 4 5 6 7 8 →