Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding

被引:23
|
作者
Liu, Xuejing [1 ]
Li, Liang [1 ]
Wang, Shuhui [1 ]
Zha, Zheng-Jun [2 ]
Li, Zechao [3 ]
Tian, Qi [4 ]
Huang, Qingming [5 ,6 ,7 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, CAS, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230027, Anhui, Peoples R China
[3] Nanjing Univ Sci & Technol, Sch Comp Sci, Nanjing 210094, Jiangsu, Peoples R China
[4] Huawei Cloud & AI, Shenzhen 518129, Guangdong, Peoples R China
[5] Univ Chinese Acad Sci, Sch Comp & Control Engn, Beijing 100190, Peoples R China
[6] Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
[7] Peng Cheng Lab, Shenzhen 518066, Guangdong, Peoples R China
基金
国家重点研发计划;
关键词
Entity enhancement; adaptive reconstruction; referring expression grounding;
D O I
10.1109/TPAMI.2022.3186410
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised Referring Expression Grounding (REG) aims to ground a particular target in an image described by a language expression while lacking the correspondence between target and expression. Two main problems exist in weakly supervised REG. First, the lack of region-level annotations introduces ambiguities between proposals and queries. Second, most previous weakly supervised REG methods ignore the discriminative location and context of the referent, causing difficulties in distinguishing the target from other same-category objects. To address the above challenges, we design an entity-enhanced adaptive reconstruction network (EARN). Specifically, EARN includes three modules: entity enhancement, adaptive grounding, and collaborative reconstruction. In entity enhancement, we calculate semantic similarity as supervision to select the candidate proposals. Adaptive grounding calculates the ranking score of candidate proposals upon subject, location and context with hierarchical attention. Collaborative reconstruction measures the ranking result from three perspectives: adaptive reconstruction, language reconstruction and attribute classification. The adaptive mechanism helps to alleviate the variance of different referring expressions. Experiments on five datasets show EARN outperforms existing state-of-the-art methods. Qualitative results demonstrate that the proposed EARN can better handle the situation where multiple objects of a particular category are situated together.
引用
收藏
页码:3003 / 3018
页数:16
相关论文
共 41 条
  • [1] Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Liu, Xuejing
    Li, Liang
    Wang, Shuhui
    Zha, Zheng-Jun
    Meng, Dechao
    Huang, Qingming
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2611 - 2620
  • [2] Progressive Semantic Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Ji, Zhong
    Wu, Jiahe
    Wang, Yaodong
    Yang, Aiping
    Han, Jungong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 13058 - 13070
  • [3] Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Liu, Xuejing
    Li, Liang
    Wang, Shuhui
    Zha, Zheng-Jun
    Su, Li
    Huang, Qingming
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 539 - 547
  • [4] Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Liu, Si
    Goulermas, John Y.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) : 4189 - 4195
  • [5] Weakly Supervised Referring Expression Grounding via Dynamic Self-Knowledge Distillation
    Mi, Jinpeng
    Chen, Zhiqian
    Zhang, Jianwei
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1254 - 1260
  • [6] Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation
    Mi, Jinpeng
    Tang, Song
    Ma, Zhiyuan
    Liu, Dan
    Li, Qingdu
    Zhang, Jianwei
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 8299 - 8305
  • [7] Adaptive knowledge distillation and integration for weakly supervised referring expression comprehension
    Mi, Jinpeng
    Wermter, Stefan
    Zhang, Jianwei
    KNOWLEDGE-BASED SYSTEMS, 2024, 286
  • [8] SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
    Nag, Sayan
    Goswami, Koustava
    Karanam, Srikrishna
    COMPUTER VISION-ECCV 2024, PT XLIV, 2025, 15102 : 485 - 503
  • [9] Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding
    Tang, Kefan
    He, Lihuo
    Wang, Nannan
    Gao, Xinbo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 95 - 107
  • [10] Universal Relocalizer forWeakly Supervised Referring Expression Grounding
    Zhang, Panpan
    Liu, Meng
    Song, Xuemeng
    Cao, Da
    Gao, Zan
    Nie, Liqiang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)