Prompt-Enhanced Generation for Multimodal Open Question Answering

被引:0
|
作者
Cui, Chenhao [1 ]
Li, Zhoujun [2 ]
机构
[1] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
multimodal question answering; retrieval augmented generation; prompt learning; vision-language alignment;
D O I
10.3390/electronics13081434
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multimodal open question answering involves retrieving relevant information from both images and their corresponding texts given a question and then generating the answer. The quality of the generated answer heavily depends on the quality of the retrieved image-text pairs. Existing methods encode and retrieve images and texts, inputting the retrieved results into a language model to generate answers. These methods overlook the semantic alignment of image-text pairs within the information source, which affects the encoding and retrieval performance. Furthermore, these methods are highly dependent on retrieval performance, and poor retrieval quality can lead to poor generation performance. To address these issues, we propose a prompt-enhanced generation model, PEG, which includes generating supplementary descriptions for images to provide ample material for image-text alignment while also utilizing vision-language joint encoding to improve encoding effects and thereby enhance retrieval performance. Contrastive learning is used to enhance the model's ability to discriminate between relevant and irrelevant information sources. Moreover, we further explore the knowledge within pre-trained model parameters through prefix-tuning to generate background knowledge relevant to the questions, offering additional input for answer generation and reducing the model's dependency on retrieval performance. Experiments conducted on the WebQA and MultimodalQA datasets demonstrate that our model outperforms other baseline models in retrieval and generation performance.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Multimodal Prompt Retrieval for Generative Visual Question Answering
    Ossowski, Timothy
    Hu, Junjie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2518 - 2535
  • [2] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [3] Prompt-enhanced Network for Hateful Meme Classification
    Liu, Junxi
    Feng, Yanyan
    Chen, Jiehai
    Xue, Yun
    Li, Fenghuan
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6397 - 6405
  • [4] Context Generation Improves Open Domain Question Answering
    Su, Dan
    Patwary, Mostofa
    Prabhumoye, Shrimai
    Xu, Peng
    Prenger, Ryan
    Shoeybi, Mohammad
    Fung, Pascale
    Anandkumar, Anima
    Catanzaro, Bryan
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 793 - 808
  • [5] Span prompt dense passage retrieval for Chinese open domain question answering
    Fan, Chunxiao
    Yan, Zhen
    Wu, Yuexin
    Qian, Bing
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (05) : 7285 - 7295
  • [6] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
  • [7] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2023, : 189 - 200
  • [8] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    arXiv, 2023,
  • [9] Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
    Fan, Chenyou
    Zhang, Xiaofan
    Zhang, Shu
    Wang, Wensheng
    Zhang, Chi
    Huang, Heng
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1999 - 2007
  • [10] Metaknowledge Enhanced Open Domain Question Answering with Wiki Documents
    Liu, Shukan
    Xu, Ruilin
    Duan, Li
    Li, Mingjie
    Liu, Yiming
    SENSORS, 2021, 21 (24)