InterREC: An Interpretable Method for Referring Expression Comprehension

被引：0

作者：

Wang, Wenbin ^{[1
]}

Pagnucco, Maurice ^{[1
]}

Xu, Chengpei ^{[2
]}

Song, Yang ^{[1
]}

机构：

[1] Univ New South Wales, Sch Comp Sci & Engn, Kensington, NSW 2033, Australia

[2] Univ New South Wales, Sch Minerals & Energy Resources Engn, Kensington, NSW 2033, Australia

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Cognition; Visualization; Task analysis; Transformers; Feature extraction; Linguistics; Representation learning; Referring expression comprehension; transfer learning; Bayesian network; reasoning;

D O I：

10.1109/TMM.2023.3251111

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Referring Expression Comprehension (REC) aims to locate the target object in the image according to a referring expression. This is a challenging task owing to the need for understanding both natural language and visual information and interpretable reasoning between them. Most existing implicit reasoning-based REC methods lack interpretability, while explicit reasoning-based REC methods have lower accuracy. To achieve competitive accuracy while providing adequate interpretability, in this work, we propose a novel explicit reasoning-based method named InterREC. First, in order to address the challenge of multi-modal understanding, we design two neural network modules based on text-image representation learning: a Text-Region Matching Module to align objects in the image and noun phrases in the expression, and a Text-Relation Matching Module to align relations between objects in the image and relational phrases in the expression. Additionally, we design a Reasoning Order Tree for handling complex expressions, which can reduce complex expressions to multiple object-relation-object triplets and therefore identify the inference order and reduce the difficulty of reasoning. At the same time, to achieve an interpretable reasoning step, we design a Bayesian Network-based explicit reasoning method. Based on the comparative evaluation on various datasets, our method achieves higher accuracy than existing explicit reasoning-based REC methods, and the visualization results demonstrate the method's high interpretability.

引用

页码：9330 / 9342

页数：13

共 65 条

[1] Chen J, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P3143
[2] Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998
[3] Cho J, 2021, PR MACH LEARN RES, V139
[4] Cho J, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P8785
[5] Cirik Volkan, 2018, P 2018 C N AM CHAPTE, P781, DOI [10.18653/v1/n18-2123, DOI 10.18653/V1/N18-2123]
[6] Visual Grounding via Accumulated Attention
Deng, Chaorui
Wu, Qi
Wu, Qingyao
Hu, Fuyuan
Lyu, Fan
Tan, Mingkui
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755
[7] TransVG: End-to-End Visual Grounding with Transformers
Deng, Jiajun
Yang, Zhengyuan
Chen, Tianlang
Zhou, Wengang
Li, Houqiang
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1749 - 1759
[8] On Interpretability of Artificial Neural Networks: A Survey
Fan, Feng-Lei
Xiong, Jinjun
Li, Mengzhou
Wang, Ge
[J]. IEEE TRANSACTIONS ON RADIATION AND PLASMA MEDICAL SCIENCES, 2021, 5 (06) : 741 - 760
[9] Gan Z., 2020, ADV NEURAL INFORM PR, V33, P6616
[10] Deep Multimodal Representation Learning: A Survey
Guo, Wenzhong
Wang, Jianwen
Wang, Shiping
[J]. IEEE ACCESS, 2019, 7 : 63373 - 63394

← 1 2 3 4 5 6 7 →