Referring expression comprehension model with matching detection and linguistic feedback

被引:0
作者
Wang, Jianming [1 ,2 ]
Cui, Enjie [3 ]
Liu, Kunliang [1 ]
Sun, Yukuan [3 ]
Liang, Jiayu [1 ]
Yuan, Chunmiao [1 ]
Duan, Xiaojie [3 ]
Jin, Guanghao [1 ,4 ]
Chung, Tae-Sun [5 ]
机构
[1] Tiangong Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
[2] Tiangong Univ, Tianjin Key Lab Autonomous Intelligence Technol &, Tianjin, Peoples R China
[3] Tiangong Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Tiangong Univ, Tianjin Int Joint Res & Dev Ctr Autonomous Intell, Tianjin, Peoples R China
[5] Ajou Univ, Dept Comp Engn, Suwon 16499, South Korea
基金
新加坡国家研究基金会; 中国国家自然科学基金;
关键词
SEGMENTATION; RECOGNITION; FEATURES; TEXTURE;
D O I
10.1049/iet-cvi.2019.0483
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The task of referring expression comprehension (REC) is to localise an image region of a specific object described by a natural language expression, and all existing REC methods assume that the object described by the referring expression must be located in the given image. However, this assumption is not correct in some real applications. For example, a visually impaired user might tell his robot 'please take the laptop on the table to me'. In fact, the laptop is not on the table anymore. To address this problem, the authors propose a novel REC model to deal with the situation where expression-image mismatching occurs and explain the mismatching by linguistic feedback. The authors' REC model consists of four modules: the expression parsing module, the entity detection module, the relationship detection module, and the matching detection module. They built a data set called NP-RefCOCO+ from RefCOCO+ including both positive samples and negative samples. The positive samples are original expression-image pairs in RefCOCO+. The negative samples are the expression-image pairs in RefCOCO+, whose expressions are replaced. They evaluate the model on NP-RefCOCO+ and the experimental results show the advantages of their method for dealing with the problem of expression-image mismatching.
引用
收藏
页码:625 / 633
页数:9
相关论文
共 46 条
  • [1] Hybrid robust iris recognition approach using iris image pre-processing, two-dimensional gabor features and multi-layer perceptron neural network/PSO
    Ahmadi, Neda
    Akbarizadeh, Gholamreza
    [J]. IET BIOMETRICS, 2018, 7 (02) : 153 - 162
  • [2] Efficient Combination of Texture and Color Features in a New Spectral Clustering Method for PolSAR Image Segmentation
    Akbarizadeh, Gholamreza
    Rahmani, Masoumeh
    [J]. NATIONAL ACADEMY SCIENCE LETTERS-INDIA, 2017, 40 (02): : 117 - 120
  • [3] A New Statistical-Based Kurtosis Wavelet Energy Feature for Texture Recognition of SAR Images
    Akbarizadeh, Gholamreza
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2012, 50 (11): : 4358 - 4368
  • [4] Akbarizadeh Gholamreza, 2013, J REMOTE SENS TECHNO, V1, P44, DOI DOI 10.18005/JRST0102003
  • [5] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [6] [Anonymous], PROC CVPR IEEE
  • [7] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [8] Multiscale Combinatorial Grouping
    Arbelaez, Pablo
    Pont-Tuset, Jordi
    Barron, Jonathan T.
    Marques, Ferran
    Malik, Jitendra
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 328 - 335
  • [9] Bansal M., 2016, P 2016 C EMP METH NA
  • [10] Bordes A., 2013, P ANN C NEUR INF PRO, P2787, DOI DOI 10.5555/2999792.2999923