Dual-graph hierarchical interaction network for referring image segmentation

被引：2

作者：

Shi, Zhaofeng ^{[1
]}

Wu, Qingbo ^{[1
]}

Li, Hongliang ^{[1
]}

Meng, Fanman ^{[1
]}

Ngan, King Ngi ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China

来源：

DISPLAYS | 2023年 / 80卷

基金：

中国国家自然科学基金;

关键词：

Referring image segmentation; Graph reasoning; Hierarchical interaction; BLIND QUALITY ASSESSMENT; MOVEMENT; HEAD;

D O I：

10.1016/j.displa.2023.102575

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Referring Image Segmentation (RIS) aims to extract the object or stuff from an image according to the given natural language expression. As a representative multi-modal reasoning task, the main challenge of RIS lies in accurately understanding and aligning two types of heterogeneous data (i.e. image and text). Existing methods typically conduct this task via inexplicit cross-modal fusion toward the visual and linguistic features, which are separately extracted from different encoders and hard to capture accurate image-text alignment information due to their distinct latent representation structures. In this paper, we propose a Dual-Graph Hierarchical Interaction Network (DGHIN) to facilitate the explicit and comprehensive alignment between the image and text data. Firstly, two graphs are separately built for the initial visual and linguistic features extracted with different pre-trained encoders. By means of graph reasoning, we obtain a unified representation structure for different modalities to capture the intra-modal entities and their contexts, where each projected node incorporates the long-range dependencies into the latent representation. Then, the Hierarchical Interaction Module (HIM) is applied to the visual and linguistic graphs to extract comprehensive inter-modal correlations from the entity level and graph level, which not only capture the corresponding keywords and visual patches but also draws the whole sentence closer to the image region with the consistent context in the latent space. Extensive experiments on RefCOCO, RefCOCO+, G-Ref, and ReferIt demonstrate that the proposed DGHIN outperforms many state-of-the-art methods. Code is available at https://github.com/ZhaofengSHI/referring-DGHIN.

引用

页数：12

共 108 条

[1] Subjective and Objective Audio-Visual Quality Assessment for User Generated Content [J].

Cao, Yuqin ;

Min, Xiongkuo ;

Sun, Wei ;

Zhai, Guangtao .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :3847-3861

[2] Attention-Guided Neural Networks for Full-Reference and No-Reference Audio-Visual Quality Assessment [J].

Cao, Yuqin ;

Min, Xiongkuo ;

Sun, Wei ;

Zhai, Guangtao .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :1882-1896

[3] DEEP NEURAL NETWORKS FOR FULL-REFERENCE AND NO-REFERENCE AUDIO-VISUAL QUALITY ASSESSMENT [J].

Cao, Yuqin ;

Min, Xiongkuo ;

Sun, Wei ;

Zhai, Guangtao .

2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, :1429-1433

[4] See-Through-Text Grouping for Referring Image Segmentation [J].

Chen, Ding-Jie ;

Jia, Songhao ;

Lo, Yi-Chen ;

Chen, Hwann-Tzong ;

Liu, Tyng-Luh .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462

[5]

Chen LC, 2017, Arxiv, DOI arXiv:1706.05587

[6] Graph-Based Global Reasoning Networks [J].

Chen, Yunpeng ;

Rohrbach, Marcus ;

Yan, Zhicheng ;

Yan, Shuicheng ;

Feng, Jiashi ;

Kalantidis, Yannis .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :433-442

[7] Knowledge and Geo-Object Based Graph Convolutional Network for Remote Sensing Semantic Segmentation [J].

Cui, Wei ;

Yao, Meng ;

Hao, Yuanjie ;

Wang, Ziwei ;

He, Xin ;

Wu, Weijie ;

Li, Jie ;

Zhao, Huilin ;

Xia, Cong ;

Wang, Jin .

SENSORS, 2021, 21 (11)

[8]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10] Vision-Language Transformer and Query Generation for Referring Segmentation [J].

Ding, Henghui ;

Liu, Chang ;

Wang, Suchen ;

Jiang, Xudong .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :16301-16310

← 1 2 3 4 5 6 7 8 9 10 →