Breaking the Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA

被引：2

作者：

Sun, Zhongfan ^{[1
]}

Hu, Yongli ^{[1
]}

Gao, Qingqing ^{[1
]}

Jiang, Huajie ^{[1
]}

Gao, Junbin ^{[2
]}

Sun, Yanfeng ^{[1
]}

Yin, Baocai ^{[1
]}

机构：

[1] Beijing Univ Technol, Beijing, Peoples R China

[2] Univ Sydney, Sydney, NSW, Australia

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

国家重点研发计划;

关键词：

Visual Question Answering; Knowledge Integration; Multi-modal Fusion;

D O I：

10.1145/3581783.3612516

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Considerable performance gains have been achieved for knowledge-based visual question answering due to the visual-language pretraining models with pre-training-then-fine-tuning paradigm. However, because the targets of the pre-training and fine-tuning stages are different, there is an evident barrier that prevents the cross-modal comprehension ability developed in the pre-training stage from fully endowing the fine-tuning task. To break this barrier, in this paper, we propose a novel hybrid prompting model for knowledge-based VQA, which inherits and incorporates the pre-training and fine-tuning tasks with a shared objective. Specifically, based on static declaration prompt, we construct a consistent goal with the fine-tuning via masked language modeling to inherit capabilities of pre-training task, while selecting the top-t relevant knowledge in a dense retrieval manner. Additionally, a dynamic knowledge prompt is learned from retrieved knowledge, which not only alleviates the length constraint on inputs for visual-language pre-trained models but also assists in providing answer features via fine-tuning. Combining and unifying the aims of the two stages could fully exploit the abilities of pre-training and fine-tuning to predict answer. We evaluate the proposed model on the OKVQA dataset, and the result shows that our model outperforms the state-of-the-art methods based on visual-language pre-training models with a noticeable performance gap and even exceeds the largescale language model of GPT-3, which proves the benefits of the hybrid prompts and the advantages of unifying pre-training to fine-tuning.

引用

页码：4065 / 4073

页数：9

共 37 条

[1]

[Anonymous], 2021, PMLR

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3] MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].

Ben-younes, Hedi ;

Cadene, Remi ;

Cord, Matthieu ;

Thome, Nicolas .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639

[4] LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection [J].

Chen, Zhuo ;

Huang, Yufeng ;

Chen, Jiaoyan ;

Geng, Yuxia ;

Fang, Yin ;

Pan, Jeff Z. ;

Zhang, Ningyu ;

Zhang, Wen .

PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE GRAPHS, IJCKG 2022, 2022, :20-29

[5]

Demszky Dorottya, 2018, Transforming question answering datasets into natural language inference datasets

[6] MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering [J].

Ding, Yang ;

Yu, Jing ;

Liu, Bang ;

Hu, Yue ;

Cui, Mingxin ;

Wu, Qi .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5079-5088

[7]

Gardères F, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P489

[8] A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA [J].

Guo, Yangyang ;

Nie, Liqiang ;

Wong, Yongkang ;

Liu, Yibing ;

Cheng, Zhiyong ;

Kankanhalli, Mohan .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :2061-2069

[9]

Hudson Drew A, 2019, P IEEECVF C COMPUTER, P6700

[10]

Karpukhin V, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6769

← 1 2 3 4 →