RETR: END-TO-END REFERRING EXPRESSION COMPREHENSION WITH TRANSFORMERS

被引：0

作者：

Rui, Yang ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611931, Peoples R China

来源：

2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP) | 2022年

关键词：

Referring expression comprehension; Object detection; Multi-modal fusion; Transformers;

D O I：

10.1109/ICCWAMTIP56608.2022.10016599

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Referring Expression Comprehension (REC) is a basic and challenging task to identify the referred region given a language expression. However, existing two-stage or one-stage methods suffer from the region proposals, the limited range of visual context and the incomplete cross-modal alignment. To address these problems, we propose a simple yet effective one-stage model, termed REC TRansformer (RETR), which is trained end-to-end. Different from the manually designed multi-modal fusion, RETR adopts a transformer decoder with alternately stacked self-attention and cross-attention layers to capture the global visual context and establish the detailed visual-linguistic correspondence. Moreover, we utilize multiple learnable tokens to obtain diverse yet complementary region representations to give the accurate prediction. Extensive experiments are conducted on four datasets and RETR achieves the state-of-the-art performance.

引用

页数：5

共 8 条

[1]

Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13

[2] TransVG: End-to-End Visual Grounding with Transformers [J].

Deng, Jiajun ;

Yang, Zhengyuan ;

Chen, Tianlang ;

Zhou, Wengang ;

Li, Houqiang .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1749-1759

[3] Learning to Assemble Neural Module Tree Networks for Visual Grounding [J].

Liu, Daqing ;

Zhang, Hanwang ;

Wu, Feng ;

Zha, Zheng-Jun .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4672-4681

[4] Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression [J].

Rezatofighi, Hamid ;

Tsoi, Nathan ;

Gwak, JunYoung ;

Sadeghian, Amir ;

Reid, Ian ;

Savarese, Silvio .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :658-666

[5] Dynamic Graph Attention for Referring Expression Comprehension [J].

Yang, Sibei ;

Li, Guanbin ;

Yu, Yizhou .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4643-4652

[6] A Fast and Accurate One-Stage Approach to Visual Grounding [J].

Yang, Zhengyuan ;

Gong, Boqing ;

Wang, Liwei ;

Huang, Wenbing ;

Yu, Dong ;

Luo, Jiebo .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4682-4692

[7] MAttNet: Modular Attention Network for Referring Expression Comprehension [J].

Yu, Licheng ;

Lin, Zhe ;

Shen, Xiaohui ;

Yang, Jimei ;

Lu, Xin ;

Bansal, Mohit ;

Berg, Tamara L. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1307-1315

[8]

Zhengyuan Yang, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12359), P387, DOI 10.1007/978-3-030-58568-6_23

← 1 →