Language-Aware Vision Transformer for Referring Segmentation

被引：1

作者：

Yang, Zhao ^{[1
]}

Wang, Jiaqi ^{[1
]}

Ye, Xubing ^{[2
]}

Tang, Yansong ^{[2
]}

Chen, Kai ^{[1
]}

Zhao, Hengshuang ^{[3
]}

Torr, Philip H. S. ^{[4
]}

机构：

[1] Shanghai AI Lab, Shanghai 200032, Peoples R China

[2] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China

[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China

[4] Univ Oxford, Dept Engn Sci, Oxford OX1 2JD, England

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2025年 / 47卷 / 07期

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Image segmentation; Transformers; Visualization; Linguistics; Feature extraction; Decoding; Three-dimensional displays; Referring segmentation; language-aware vision transformer; multi-modal understanding;

D O I：

10.1109/TPAMI.2024.3468640

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

引用

页码：5238 / 5255

页数：18

共 96 条

[41] Feature Pyramid Networks for Object Detection [J].

Lin, Tsung-Yi ;

Dollar, Piotr ;

Girshick, Ross ;

He, Kaiming ;

Hariharan, Bharath ;

Belongie, Serge .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :936-944

[42] GRES: Generalized Referring Expression Segmentation [J].

Liu, Chang ;

Ding, Henghui ;

Jiang, Xudong .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :23592-23601

[43] Recurrent Multimodal Interaction for Referring Image Segmentation [J].

Liu, Chenxi ;

Lin, Zhe ;

Shen, Xiaohui ;

Yang, Jimei ;

Lu, Xin ;

Yuille, Alan .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1280-1289

[44] PolyFormer: Referring Image Segmentation as Sequential Polygon Generation [J].

Liu, Jiang ;

Ding, Hui ;

Cai, Zhaowei ;

Zhang, Yuting ;

Satzoda, Ravi Kumar ;

Mahadevan, Vijay ;

Manmatha, R. .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18653-18663

[45] Cross-Modal Progressive Comprehension for Referring Segmentation [J].

Liu, Si ;

Hui, Tianrui ;

Huang, Shaofei ;

Wei, Yunchao ;

Li, Bo ;

Li, Guanbin .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (09) :4761-4775

[46] Video Swin Transformer [J].

Liu, Ze ;

Ning, Jia ;

Cao, Yue ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Hu, Han .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :3192-3201

[47] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].

Liu, Ze ;

Lin, Yutong ;

Cao, Yue ;

Hu, Han ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Guo, Baining .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002

[48]

Long J, 2015, PROC CVPR IEEE, P3431, DOI 10.1109/CVPR.2015.7298965

[49]

Loshchilov I., 2019, INT C LEARNING REPRE

[50]

Lu JS, 2019, ADV NEUR IN, V32

← 1 2 3 4 5 6 7 8 9 10 →