InterREC: An Interpretable Method for Referring Expression Comprehension

被引:0
作者
Wang, Wenbin [1 ]
Pagnucco, Maurice [1 ]
Xu, Chengpei [2 ]
Song, Yang [1 ]
机构
[1] Univ New South Wales, Sch Comp Sci & Engn, Kensington, NSW 2033, Australia
[2] Univ New South Wales, Sch Minerals & Energy Resources Engn, Kensington, NSW 2033, Australia
关键词
Cognition; Visualization; Task analysis; Transformers; Feature extraction; Linguistics; Representation learning; Referring expression comprehension; transfer learning; Bayesian network; reasoning;
D O I
10.1109/TMM.2023.3251111
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Referring Expression Comprehension (REC) aims to locate the target object in the image according to a referring expression. This is a challenging task owing to the need for understanding both natural language and visual information and interpretable reasoning between them. Most existing implicit reasoning-based REC methods lack interpretability, while explicit reasoning-based REC methods have lower accuracy. To achieve competitive accuracy while providing adequate interpretability, in this work, we propose a novel explicit reasoning-based method named InterREC. First, in order to address the challenge of multi-modal understanding, we design two neural network modules based on text-image representation learning: a Text-Region Matching Module to align objects in the image and noun phrases in the expression, and a Text-Relation Matching Module to align relations between objects in the image and relational phrases in the expression. Additionally, we design a Reasoning Order Tree for handling complex expressions, which can reduce complex expressions to multiple object-relation-object triplets and therefore identify the inference order and reduce the difficulty of reasoning. At the same time, to achieve an interpretable reasoning step, we design a Bayesian Network-based explicit reasoning method. Based on the comparative evaluation on various datasets, our method achieves higher accuracy than existing explicit reasoning-based REC methods, and the visualization results demonstrate the method's high interpretability.
引用
收藏
页码:9330 / 9342
页数:13
相关论文
共 65 条
  • [1] Chen J, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P3143
  • [2] Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998
  • [3] Cho J, 2021, PR MACH LEARN RES, V139
  • [4] Cho J, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P8785
  • [5] Cirik Volkan, 2018, P 2018 C N AM CHAPTE, P781, DOI [10.18653/v1/n18-2123, DOI 10.18653/V1/N18-2123]
  • [6] Visual Grounding via Accumulated Attention
    Deng, Chaorui
    Wu, Qi
    Wu, Qingyao
    Hu, Fuyuan
    Lyu, Fan
    Tan, Mingkui
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755
  • [7] TransVG: End-to-End Visual Grounding with Transformers
    Deng, Jiajun
    Yang, Zhengyuan
    Chen, Tianlang
    Zhou, Wengang
    Li, Houqiang
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1749 - 1759
  • [8] On Interpretability of Artificial Neural Networks: A Survey
    Fan, Feng-Lei
    Xiong, Jinjun
    Li, Mengzhou
    Wang, Ge
    [J]. IEEE TRANSACTIONS ON RADIATION AND PLASMA MEDICAL SCIENCES, 2021, 5 (06) : 741 - 760
  • [9] Gan Z., 2020, ADV NEURAL INFORM PR, V33, P6616
  • [10] Deep Multimodal Representation Learning: A Survey
    Guo, Wenzhong
    Wang, Jianwen
    Wang, Shiping
    [J]. IEEE ACCESS, 2019, 7 : 63373 - 63394