Language-Aware Vision Transformer for Referring Segmentation

被引:1
作者
Yang, Zhao [1 ]
Wang, Jiaqi [1 ]
Ye, Xubing [2 ]
Tang, Yansong [2 ]
Chen, Kai [1 ]
Zhao, Hengshuang [3 ]
Torr, Philip H. S. [4 ]
机构
[1] Shanghai AI Lab, Shanghai 200032, Peoples R China
[2] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[4] Univ Oxford, Dept Engn Sci, Oxford OX1 2JD, England
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Image segmentation; Transformers; Visualization; Linguistics; Feature extraction; Decoding; Three-dimensional displays; Referring segmentation; language-aware vision transformer; multi-modal understanding;
D O I
10.1109/TPAMI.2024.3468640
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.
引用
收藏
页码:5238 / 5255
页数:18
相关论文
共 96 条
[41]   Feature Pyramid Networks for Object Detection [J].
Lin, Tsung-Yi ;
Dollar, Piotr ;
Girshick, Ross ;
He, Kaiming ;
Hariharan, Bharath ;
Belongie, Serge .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :936-944
[42]   GRES: Generalized Referring Expression Segmentation [J].
Liu, Chang ;
Ding, Henghui ;
Jiang, Xudong .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :23592-23601
[43]   Recurrent Multimodal Interaction for Referring Image Segmentation [J].
Liu, Chenxi ;
Lin, Zhe ;
Shen, Xiaohui ;
Yang, Jimei ;
Lu, Xin ;
Yuille, Alan .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1280-1289
[44]   PolyFormer: Referring Image Segmentation as Sequential Polygon Generation [J].
Liu, Jiang ;
Ding, Hui ;
Cai, Zhaowei ;
Zhang, Yuting ;
Satzoda, Ravi Kumar ;
Mahadevan, Vijay ;
Manmatha, R. .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18653-18663
[45]   Cross-Modal Progressive Comprehension for Referring Segmentation [J].
Liu, Si ;
Hui, Tianrui ;
Huang, Shaofei ;
Wei, Yunchao ;
Li, Bo ;
Li, Guanbin .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (09) :4761-4775
[46]   Video Swin Transformer [J].
Liu, Ze ;
Ning, Jia ;
Cao, Yue ;
Wei, Yixuan ;
Zhang, Zheng ;
Lin, Stephen ;
Hu, Han .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :3192-3201
[47]   Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].
Liu, Ze ;
Lin, Yutong ;
Cao, Yue ;
Hu, Han ;
Wei, Yixuan ;
Zhang, Zheng ;
Lin, Stephen ;
Guo, Baining .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002
[48]  
Long J, 2015, PROC CVPR IEEE, P3431, DOI 10.1109/CVPR.2015.7298965
[49]  
Loshchilov I., 2019, INT C LEARNING REPRE
[50]  
Lu JS, 2019, ADV NEUR IN, V32