Referring Expression Comprehension Using Language Adaptive Inference

被引：0

作者：

Su, Wei ^{[1
]}

Miao, Peihan ^{[2
]}

Dou, Huanzhang ^{[1
]}

Fu, Yongjian ^{[1
]}

Li, Xi ^{[1
,3
,4
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci Technol, Hangzhou, Peoples R China

[2] Zhejiang Univ, Sch Software Technol, Hangzhou, Peoples R China

[3] Zhejiang Univ, Shanghai Inst Adv Study, Hangzhou, Peoples R China

[4] Shanghai AI Lab, Shanghai, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2 | 2023年

基金：

美国国家科学基金会; 中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Different from universal object detection, referring expression comprehension (REC) aims to locate specific objects referred to by natural language expressions. The expression provides high-level concepts of relevant visual and contextual patterns, which vary significantly with different expressions and account for only a few of those encoded in the REC model. This leads us to a question: do we really need the entire network with a fixed structure for various referring expressions? Ideally, given an expression, only expression-relevant components of the REC model are required. These components should be small in number as each expression only contains very few visual and contextual clues. This paper explores the adaptation between expressions and REC models for dynamic inference. Concretely, we propose a neat yet efficient framework named Language Adaptive Dynamic Subnets (LADS), which can extract language-adaptive subnets from the REC model conditioned on the referring expressions. By using the compact subnet, the inference can be more economical and efficient. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and Referit show that the proposed method achieves faster inference speed and higher accuracy against state-of-the-art approaches.

引用

页码：2357 / 2365

页数：9

共 49 条

[1] G3RAPHGROUND: Graph-based Language Grounding
Bajaj, Mohit
Wang, Lanjun
Sigal, Leonid
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4280 - 4289
[2] Cai Han, 2019, INT C LEARN REPR
[3] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[4] Self-Adaptive Network Pruning
Chen, Jinting
Zhu, Zhaocheng
Li, Cheng
Zhao, Yuming
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT I, 2019, 11953 : 175 - 186
[5] Chen L, 2021, AAAI CONF ARTIF INTE, V35, P1036
[6] You Look Twice: GaterNet for Dynamic Filter Selection in CNNs
Chen, Zhourong
Li, Yang
Bengio, Samy
Si, Si
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9164 - 9172
[7] TransVG: End-to-End Visual Grounding with Transformers
Deng, Jiajun
Yang, Zhengyuan
Chen, Tianlang
Zhou, Wengang
Li, Houqiang
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1749 - 1759
[8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9] Bejnordi BE, 2020, Arxiv, DOI arXiv:1907.06627
[10] Escalante H. J., 2010, The segmented and annotated IAPR TC-12 benchmark

← 1 2 3 4 5 →