Training Question Answering Models From Synthetic Data

被引:0
|
作者
Puri, Raul [1 ]
Spring, Ryan [2 ]
Shoeybi, Mohammad [1 ]
Patwary, Mostofa [1 ]
Catanzaro, Bryan [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Rice Univ, Houston, TX USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQUAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQUAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic text corpus generated by an 8.3 billion parameter GPT-2 model and achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQUAD1.1 dev set. We further apply our methodology to SQUAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.
引用
收藏
页码:5811 / 5826
页数:16
相关论文
共 50 条
  • [31] Question-Answering for Agricultural Open Data
    Kawamura, Takahiro
    Ohsuga, Akihiko
    TRANSACTIONS ON LARGE-SCALE DATA- AND KNOWLEDGE-CENTERED SYSTEMS XVI, 2014, 8960 : 15 - 28
  • [32] An Introduction to Question Answering over Linked Data
    Unger, Christina
    Freitas, Andre
    Cimiano, Philipp
    REASONING WEB: REASONING ON THE WEB IN THE BIG DATA ERA, 2014, 8714 : 100 - +
  • [33] QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering
    Shi, Haochen
    Wang, Weiqi
    Fang, Tianqing
    Xu, Baixuan
    Ding, Wenxuan
    Liu, Xin
    Song, Yangqiu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15329 - 15341
  • [34] An introduction to question answering over linked data
    1600, Springer Verlag (8714):
  • [35] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
    Ko, Dohwan
    Lee, Ji Soo
    Choi, Miso
    Chu, Jaewon
    Park, Jihwan
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3078 - 3089
  • [36] OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering
    Jiang, Zhengbao
    Mao, Yi
    He, Pengcheng
    Neubig, Graham
    Chen, Weizhu
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 932 - 942
  • [37] Explicit Bias Discovery in Visual Question Answering Models
    Manjunatha, Varun
    Saini, Nirat
    Davis, Larry S.
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9554 - 9563
  • [38] Soft pattern matching models for definitional question answering
    Cui, Hang
    Kan, Min-Yen
    Chua, Tatseng
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2007, 25 (02)
  • [39] Learning Distributed Representations of Data in Community Question Answering for Question Retrieval
    Zhang, Kai
    Wu, Wei
    Wang, Fang
    Zhou, Ming
    Li, Zhoujun
    PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, : 533 - 542
  • [40] Reasoning with large language models for medical question answering
    Lucas, Mary M.
    Yang, Justin
    Pomeroy, Jon K.
    Yang, Christopher C.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09)