SELF-ENHANCED TRAINING FRAMEWORK FOR REFERRING EXPRESSION GROUNDING

被引:0
作者
Chen, Yitao [1 ]
Du, Ruoyi [1 ]
Liang, Kongming [1 ]
Ma, Zhanyu [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Pattern Recognit & Intelligent Syst Lab, Beijing 100876, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Referring expression grounding; weakly-supervised; fully-supervised; pseudo-label;
D O I
10.1109/ICIP49359.2023.10222357
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly-supervised referring expression grounding (REG) aims at locating the image region described by a query sentence, where the mapping between the referential region and query is not available during the training stage. Noticing the significant gap between the fully- and weakly-supervised approaches, we develop a Self-Enhanced Training(SET) framework in this paper. Specifically, we first train the network under a weakly-supervised setting. Then, the model outputs are collected and filtered according to the confidence score and serve as pseudo-labels. Finally, with the help of these pseudo-labels, we tune the model under a fully-supervised setting. The SET framework provides a simple way of generating pseudo-labels that build a bridge between weak and full supervision. Experimental results demonstrate that model trained through our SET framework outperforms existing traditional methods on RefCOCO, RefCOCO+, and RefCOCOg datasets. The code is available at https://github.com/HTDL98/SET-framework.
引用
收藏
页码:3060 / 3064
页数:5
相关论文
共 18 条
[1]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[2]  
[Anonymous], 2016, Oxid. Med. Cell. Longev., DOI DOI 10.1155/2016/1689602
[3]  
[Anonymous], 2014, P 2014 C EMPIRICAL M, DOI 10.3115/v1/D14-1082
[4]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[5]   Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].
Chen, Kan ;
Gao, Jiyang ;
Nevatia, Ram .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050
[6]  
Chen S., 2022, P IEEECVF C COMPUTER, P15534
[7]   TransVG: End-to-End Visual Grounding with Transformers [J].
Deng, Jiajun ;
Yang, Zhengyuan ;
Chen, Tianlang ;
Zhou, Wengang ;
Li, Houqiang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1749-1759
[8]   Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [J].
Jiang, Haojun ;
Lin, Yuanze ;
Han, Dongchen ;
Song, Shiji ;
Huang, Gao .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15492-15502
[9]  
Kingma DP, 2014, ADV NEUR IN, V27
[10]   Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding [J].
Liu, Xuejing ;
Li, Liang ;
Wang, Shuhui ;
Zha, Zheng-Jun ;
Meng, Dechao ;
Huang, Qingming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2611-2620