Breaking the Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA

被引:2
作者
Sun, Zhongfan [1 ]
Hu, Yongli [1 ]
Gao, Qingqing [1 ]
Jiang, Huajie [1 ]
Gao, Junbin [2 ]
Sun, Yanfeng [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing, Peoples R China
[2] Univ Sydney, Sydney, NSW, Australia
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
国家重点研发计划;
关键词
Visual Question Answering; Knowledge Integration; Multi-modal Fusion;
D O I
10.1145/3581783.3612516
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Considerable performance gains have been achieved for knowledge-based visual question answering due to the visual-language pretraining models with pre-training-then-fine-tuning paradigm. However, because the targets of the pre-training and fine-tuning stages are different, there is an evident barrier that prevents the cross-modal comprehension ability developed in the pre-training stage from fully endowing the fine-tuning task. To break this barrier, in this paper, we propose a novel hybrid prompting model for knowledge-based VQA, which inherits and incorporates the pre-training and fine-tuning tasks with a shared objective. Specifically, based on static declaration prompt, we construct a consistent goal with the fine-tuning via masked language modeling to inherit capabilities of pre-training task, while selecting the top-t relevant knowledge in a dense retrieval manner. Additionally, a dynamic knowledge prompt is learned from retrieved knowledge, which not only alleviates the length constraint on inputs for visual-language pre-trained models but also assists in providing answer features via fine-tuning. Combining and unifying the aims of the two stages could fully exploit the abilities of pre-training and fine-tuning to predict answer. We evaluate the proposed model on the OKVQA dataset, and the result shows that our model outperforms the state-of-the-art methods based on visual-language pre-training models with a noticeable performance gap and even exceeds the largescale language model of GPT-3, which proves the benefits of the hybrid prompts and the advantages of unifying pre-training to fine-tuning.
引用
收藏
页码:4065 / 4073
页数:9
相关论文
共 37 条
[1]  
[Anonymous], 2021, PMLR
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[4]   LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection [J].
Chen, Zhuo ;
Huang, Yufeng ;
Chen, Jiaoyan ;
Geng, Yuxia ;
Fang, Yin ;
Pan, Jeff Z. ;
Zhang, Ningyu ;
Zhang, Wen .
PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE GRAPHS, IJCKG 2022, 2022, :20-29
[5]  
Demszky Dorottya, 2018, Transforming question answering datasets into natural language inference datasets
[6]   MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering [J].
Ding, Yang ;
Yu, Jing ;
Liu, Bang ;
Hu, Yue ;
Cui, Mingxin ;
Wu, Qi .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5079-5088
[7]  
Gardères F, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P489
[8]   A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA [J].
Guo, Yangyang ;
Nie, Liqiang ;
Wong, Yongkang ;
Liu, Yibing ;
Cheng, Zhiyong ;
Kankanhalli, Mohan .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :2061-2069
[9]  
Hudson Drew A, 2019, P IEEECVF C COMPUTER, P6700
[10]  
Karpukhin V, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6769