Language-Aware Vision Transformer for Referring Segmentation

被引：1

作者：

Yang, Zhao ^{[1
]}

Wang, Jiaqi ^{[1
]}

Ye, Xubing ^{[2
]}

Tang, Yansong ^{[2
]}

Chen, Kai ^{[1
]}

Zhao, Hengshuang ^{[3
]}

Torr, Philip H. S. ^{[4
]}

机构：

[1] Shanghai AI Lab, Shanghai 200032, Peoples R China

[2] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China

[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China

[4] Univ Oxford, Dept Engn Sci, Oxford OX1 2JD, England

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2025年 / 47卷 / 07期

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Image segmentation; Transformers; Visualization; Linguistics; Feature extraction; Decoding; Three-dimensional displays; Referring segmentation; language-aware vision transformer; multi-modal understanding;

D O I：

10.1109/TPAMI.2024.3468640

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

引用

页码：5238 / 5255

页数：18

共 96 条

[1]

[Anonymous], 2010, P 27 INT C MACH LEAR

[2] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[3]

Bellver M, 2020, Arxiv, DOI arXiv:2010.00263

[4] Boundary Loss for Remote Sensing Imagery Semantic Segmentation [J].

Bokhovkin, Alexey ;

Burnaev, Evgeny .

ADVANCES IN NEURAL NETWORKS - ISNN 2019, PT II, 2019, 11555 :388-401

[5] End-to-End Referring Video Object Segmentation with Multimodal Transformers [J].

Botach, Adam ;

Zheltonozhskii, Evgenii ;

Baskin, Chaim .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4975-4985

[6] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[8] See-Through-Text Grouping for Referring Image Segmentation [J].

Chen, Ding-Jie ;

Jia, Songhao ;

Lo, Yi-Chen ;

Chen, Hwann-Tzong ;

Liu, Tyng-Luh .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462

[9]

Chen LC, 2017, Arxiv, DOI arXiv:1706.05587

[10] ImageSpirit: Verbal Guided Image Parsing [J].

Cheng, Ming-Ming ;

Zheng, Shuai ;

Lin, Wen-Yan ;

Vineet, Vibhav ;

Sturgess, Paul ;

Crook, Nigel ;

Mitra, Niloy J. ;

Torr, Philip .

ACM TRANSACTIONS ON GRAPHICS, 2014, 34 (01)

← 1 2 3 4 5 6 7 8 9 10 →