Language-Aware Vision Transformer for Referring Segmentation

被引:1
作者
Yang, Zhao [1 ]
Wang, Jiaqi [1 ]
Ye, Xubing [2 ]
Tang, Yansong [2 ]
Chen, Kai [1 ]
Zhao, Hengshuang [3 ]
Torr, Philip H. S. [4 ]
机构
[1] Shanghai AI Lab, Shanghai 200032, Peoples R China
[2] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[4] Univ Oxford, Dept Engn Sci, Oxford OX1 2JD, England
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Image segmentation; Transformers; Visualization; Linguistics; Feature extraction; Decoding; Three-dimensional displays; Referring segmentation; language-aware vision transformer; multi-modal understanding;
D O I
10.1109/TPAMI.2024.3468640
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.
引用
收藏
页码:5238 / 5255
页数:18
相关论文
共 96 条
[1]  
[Anonymous], 2010, P 27 INT C MACH LEAR
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]  
Bellver M, 2020, Arxiv, DOI arXiv:2010.00263
[4]   Boundary Loss for Remote Sensing Imagery Semantic Segmentation [J].
Bokhovkin, Alexey ;
Burnaev, Evgeny .
ADVANCES IN NEURAL NETWORKS - ISNN 2019, PT II, 2019, 11555 :388-401
[5]   End-to-End Referring Video Object Segmentation with Multimodal Transformers [J].
Botach, Adam ;
Zheltonozhskii, Evgenii ;
Baskin, Chaim .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4975-4985
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]   See-Through-Text Grouping for Referring Image Segmentation [J].
Chen, Ding-Jie ;
Jia, Songhao ;
Lo, Yi-Chen ;
Chen, Hwann-Tzong ;
Liu, Tyng-Luh .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462
[9]  
Chen LC, 2017, Arxiv, DOI arXiv:1706.05587
[10]   ImageSpirit: Verbal Guided Image Parsing [J].
Cheng, Ming-Ming ;
Zheng, Shuai ;
Lin, Wen-Yan ;
Vineet, Vibhav ;
Sturgess, Paul ;
Crook, Nigel ;
Mitra, Niloy J. ;
Torr, Philip .
ACM TRANSACTIONS ON GRAPHICS, 2014, 34 (01)