UniTR: A Unified TRansformer-Based Framework for Co-Object and Multi-Modal Saliency Detection

被引:15
作者
Guo, Ruohao [1 ]
Ying, Xianghua [1 ]
Qi, Yanyu [2 ]
Qu, Liao [3 ]
机构
[1] Peking Univ, Sch Intelligence Sci & Technol, Natl Key Lab Gen Artificial Intelligence, Beijing 100871, Peoples R China
[2] China Agr Univ, Coll Informat & Elect Engn, Beijing 100091, Peoples R China
[3] Carnegie Mellon Univ, Elect & Comp Engn Dept, Pittsburgh, PA 15213 USA
基金
中国国家自然科学基金;
关键词
Object detection; Feature extraction; Task analysis; Transformers; Image segmentation; Semantics; Computer architecture; Co-object segmentation; multi-modal salient object detection; transformer; deep learning; SEGMENTATION; GRAPH; OPTIMIZATION; REFINEMENT; NETWORK; DEEP;
D O I
10.1109/TMM.2024.3369922
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed a growing interest in co-object segmentation and multi-modal salient object detection. Many efforts are devoted to segmenting co-existed objects among a group of images or detecting salient objects from different modalities. Albeit the appreciable performance achieved on respective benchmarks, each of these methods is limited to a specific task and cannot be generalized to other tasks. In this paper, we develop a Unified TRansformer-based framework, namely UniTR, aiming at tackling the above tasks individually with a unified architecture. Specifically, a transformer module (CoFormer) is introduced to learn the consistency of relevant objects or complementarity from different modalities. To generate high-quality segmentation maps, we adopt a dual-stream decoding paradigm that allows the extracted consistent or complementary information to better guide mask prediction. Moreover, a feature fusion module (ZoomFormer) is designed to enhance backbone features and capture multi-granularity and multi-semantic information. Extensive experiments show that our UniTR performs well on 17 benchmarks, and surpasses existing state-of-the-art approaches.
引用
收藏
页码:7622 / 7635
页数:14
相关论文
共 129 条
[1]  
Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596
[2]   Audio-visual domain adaptation using conditional semi-supervised Generative Adversarial Networks [J].
Athanasiadis, Christos ;
Hortal, Enrique ;
Asteriadis, Stylianos .
NEUROCOMPUTING, 2020, 397 :331-344
[3]   Adaptive Group-Wise Consistency Network for Co-Saliency Detection [J].
Bai, Zhen ;
Liu, Zhi ;
Li, Gongyang ;
Wang, Yang .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :764-776
[4]   iCoseg: Interactive Co-segmentation with Intelligent Scribble Guidance [J].
Batra, Dhruv ;
Kowdle, Adarsh ;
Parikh, Devi ;
Luo, Jiebo ;
Chen, Tsuhan .
2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, :3169-3176
[5]   YOLACT Real-time Instance Segmentation [J].
Bolya, Daniel ;
Zhou, Chong ;
Xiao, Fanyi ;
Lee, Yong Jae .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9156-9165
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Improved Robust Video Saliency Detection Based on Long-Term Spatial-Temporal Information [J].
Chen, Chenglizhao ;
Wang, Guotao ;
Peng, Chong ;
Zhang, Xiaowei ;
Qin, Hong .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :1090-1100
[8]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[9]   Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection [J].
Chen, Hao ;
Li, Youfu ;
Su, Dan .
PATTERN RECOGNITION, 2019, 86 :376-385
[10]   Semantic Aware Attention Based Deep Object Co-segmentation [J].
Chen, Hong ;
Huang, Yifei ;
Nakayama, Hideki .
COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 :435-450