Referring Expression Comprehension Using Language Adaptive Inference

被引:0
作者
Su, Wei [1 ]
Miao, Peihan [2 ]
Dou, Huanzhang [1 ]
Fu, Yongjian [1 ]
Li, Xi [1 ,3 ,4 ]
机构
[1] Zhejiang Univ, Coll Comp Sci Technol, Hangzhou, Peoples R China
[2] Zhejiang Univ, Sch Software Technol, Hangzhou, Peoples R China
[3] Zhejiang Univ, Shanghai Inst Adv Study, Hangzhou, Peoples R China
[4] Shanghai AI Lab, Shanghai, Peoples R China
来源
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2 | 2023年
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different from universal object detection, referring expression comprehension (REC) aims to locate specific objects referred to by natural language expressions. The expression provides high-level concepts of relevant visual and contextual patterns, which vary significantly with different expressions and account for only a few of those encoded in the REC model. This leads us to a question: do we really need the entire network with a fixed structure for various referring expressions? Ideally, given an expression, only expression-relevant components of the REC model are required. These components should be small in number as each expression only contains very few visual and contextual clues. This paper explores the adaptation between expressions and REC models for dynamic inference. Concretely, we propose a neat yet efficient framework named Language Adaptive Dynamic Subnets (LADS), which can extract language-adaptive subnets from the REC model conditioned on the referring expressions. By using the compact subnet, the inference can be more economical and efficient. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and Referit show that the proposed method achieves faster inference speed and higher accuracy against state-of-the-art approaches.
引用
收藏
页码:2357 / 2365
页数:9
相关论文
共 49 条
  • [1] G3RAPHGROUND: Graph-based Language Grounding
    Bajaj, Mohit
    Wang, Lanjun
    Sigal, Leonid
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4280 - 4289
  • [2] Cai Han, 2019, INT C LEARN REPR
  • [3] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [4] Self-Adaptive Network Pruning
    Chen, Jinting
    Zhu, Zhaocheng
    Li, Cheng
    Zhao, Yuming
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT I, 2019, 11953 : 175 - 186
  • [5] Chen L, 2021, AAAI CONF ARTIF INTE, V35, P1036
  • [6] You Look Twice: GaterNet for Dynamic Filter Selection in CNNs
    Chen, Zhourong
    Li, Yang
    Bengio, Samy
    Si, Si
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9164 - 9172
  • [7] TransVG: End-to-End Visual Grounding with Transformers
    Deng, Jiajun
    Yang, Zhengyuan
    Chen, Tianlang
    Zhou, Wengang
    Li, Houqiang
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1749 - 1759
  • [8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [9] Bejnordi BE, 2020, Arxiv, DOI arXiv:1907.06627
  • [10] Escalante H. J., 2010, The segmented and annotated IAPR TC-12 benchmark