UniTR: A Unified TRansformer-Based Framework for Co-Object and Multi-Modal Saliency Detection

被引：15

作者：

Guo, Ruohao ^{[1
]}

Ying, Xianghua ^{[1
]}

Qi, Yanyu ^{[2
]}

Qu, Liao ^{[3
]}

机构：

[1] Peking Univ, Sch Intelligence Sci & Technol, Natl Key Lab Gen Artificial Intelligence, Beijing 100871, Peoples R China

[2] China Agr Univ, Coll Informat & Elect Engn, Beijing 100091, Peoples R China

[3] Carnegie Mellon Univ, Elect & Comp Engn Dept, Pittsburgh, PA 15213 USA

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Object detection; Feature extraction; Task analysis; Transformers; Image segmentation; Semantics; Computer architecture; Co-object segmentation; multi-modal salient object detection; transformer; deep learning; SEGMENTATION; GRAPH; OPTIMIZATION; REFINEMENT; NETWORK; DEEP;

D O I：

10.1109/TMM.2024.3369922

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent years have witnessed a growing interest in co-object segmentation and multi-modal salient object detection. Many efforts are devoted to segmenting co-existed objects among a group of images or detecting salient objects from different modalities. Albeit the appreciable performance achieved on respective benchmarks, each of these methods is limited to a specific task and cannot be generalized to other tasks. In this paper, we develop a Unified TRansformer-based framework, namely UniTR, aiming at tackling the above tasks individually with a unified architecture. Specifically, a transformer module (CoFormer) is introduced to learn the consistency of relevant objects or complementarity from different modalities. To generate high-quality segmentation maps, we adopt a dual-stream decoding paradigm that allows the extracted consistent or complementary information to better guide mask prediction. Moreover, a feature fusion module (ZoomFormer) is designed to enhance backbone features and capture multi-granularity and multi-semantic information. Extensive experiments show that our UniTR performs well on 17 benchmarks, and surpasses existing state-of-the-art approaches.

引用

页码：7622 / 7635

页数：14

共 129 条

[1]

Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596

[2] Audio-visual domain adaptation using conditional semi-supervised Generative Adversarial Networks [J].

Athanasiadis, Christos ;

Hortal, Enrique ;

Asteriadis, Stylianos .

NEUROCOMPUTING, 2020, 397 :331-344

[3] Adaptive Group-Wise Consistency Network for Co-Saliency Detection [J].

Bai, Zhen ;

Liu, Zhi ;

Li, Gongyang ;

Wang, Yang .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :764-776

[4] iCoseg: Interactive Co-segmentation with Intelligent Scribble Guidance [J].

Batra, Dhruv ;

Kowdle, Adarsh ;

Parikh, Devi ;

Luo, Jiebo ;

Chen, Tsuhan .

2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, :3169-3176

[5] YOLACT Real-time Instance Segmentation [J].

Bolya, Daniel ;

Zhou, Chong ;

Xiao, Fanyi ;

Lee, Yong Jae .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9156-9165

[6] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[7] Improved Robust Video Saliency Detection Based on Long-Term Spatial-Temporal Information [J].

Chen, Chenglizhao ;

Wang, Guotao ;

Peng, Chong ;

Zhang, Xiaowei ;

Qin, Hong .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :1090-1100

[8] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[9] Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection [J].

Chen, Hao ;

Li, Youfu ;

Su, Dan .

PATTERN RECOGNITION, 2019, 86 :376-385

[10] Semantic Aware Attention Based Deep Object Co-segmentation [J].

Chen, Hong ;

Huang, Yifei ;

Nakayama, Hideki .

COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 :435-450

← 1 2 3 4 5 6 7 8 9 10 →