Text-Vision Relationship Alignment for Referring Image Segmentation

被引：1

作者：

Pu, Mingxing ^{[1
]}

Luo, Bing ^{[1
]}

Zhang, Chao ^{[2
]}

Xu, Li ^{[3
]}

Xu, Fayou ^{[1
]}

Kong, Mingming ^{[1
]}

机构：

[1] Xihua Univ, Sch Comp & Software Engn, Chengdu 610039, Peoples R China

[2] Sichuan Police Coll, Key Lab Intelligent Policing, Luzhou 646000, Peoples R China

[3] Xihua Univ, Sch Sci, Chengdu 610039, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2024年 / 56卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Semantic parsing; Text-vision alignment; Referring image segmentation;

D O I：

10.1007/s11063-024-11487-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring image segmentation aims to segment object in an image based on a referring expression. Its difficulty lies in aligning expression semantics with visual instances. The existing methods based on semantic reasoning are limited by the performance of external syntax parser and do not explicitly explore the relationships between visual instances. This article proposes an end-to-end method for referring image segmentation by aligning 'linguistic relationship' with 'visual relationships'. This method does not rely on external syntax parser for expression parsing. In this paper, the expression is adaptively and structurally parsed into three components: 'subject', 'object', and 'linguistic relationship' by the Semantic Component Parser (SCP) in a learnable manner. Instances Activation Map Module (IAM) locates multiple visual instances based on the subject and object. In addition, the Relationship Based Visual Localization Module (RBVL) firstly enables each instance of the image to learn global knowledge, then decodes the visual relationships between these visual instances, and finally aligns the visual relationships with the linguistic relationships to further accurately locate the target object. The experimental results show that the proposed method improves performance by 4- 9% compared with baseline method on multiple referring image segmentation datasets.

引用

页数：21

共 53 条

[31] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].

Liu, Ze ;

Lin, Yutong ;

Cao, Yue ;

Hu, Han ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Guo, Baining .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002

[32] Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation [J].

Luo, Gen ;

Zhou, Yiyi ;

Sun, Xiaoshuai ;

Cao, Liujuan ;

Wu, Chenglin ;

Deng, Cheng ;

Ji, Rongrong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10031-10040

[33] The Stanford CoreNLP Natural Language Processing Toolkit [J].

Manning, Christopher D. ;

Surdeanu, Mihai ;

Bauer, John ;

Finkel, Jenny ;

Bethard, Steven J. ;

McClosky, David .

PROCEEDINGS OF 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: SYSTEM DEMONSTRATIONS, 2014, :55-60

[34] Generation and Comprehension of Unambiguous Object Descriptions [J].

Mao, Junhua ;

Huang, Jonathan ;

Toshev, Alexander ;

Camburu, Oana ;

Yuille, Alan ;

Murphy, Kevin .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :11-20

[35]

Margffoy-Tuay E, 2018, Arxiv, DOI arXiv:1807.02257

[36]

Mikolov T, 2013, North American chapter of the association for computational linguistics

[37]

Pennington J., 2014, P 2014 C EMPIRICAL M, P1532, DOI DOI 10.3115/V1/D14-1162

[38] Referring Image Segmentation by Generative Adversarial Learning [J].

Qiu, Shuang ;

Zhao, Yao ;

Jiao, Jianbo ;

Wei, Yunchao ;

Wei, Shikui .

IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (05) :1333-1344

[39]

Radford A, 2021, PR MACH LEARN RES, V139

[40] Query Reconstruction Network for Referring Expression Image Segmentation [J].

Shi, Hengcan ;

Li, Hongliang ;

Wu, Qingbo ;

Ngan, King Ngi .

IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :995-1007

← 1 2 3 4 5 6 →