Evaluating Open-Domain Question Answering in the Era of Large Language Models

被引:0
|
作者
Kamalloo, Ehsan [1 ,2 ]
Dziri, Nouha [3 ]
Clarke, Charles L. A. [2 ]
Rafiei, Davood [1 ]
机构
[1] Univ Alberta, Edmonton, AB, Canada
[2] Univ Waterloo, Waterloo, ON, Canada
[3] Allen Inst Artificial Intelligence, Seattle, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-OPEN, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-OPEN. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.
引用
收藏
页码:5591 / 5606
页数:16
相关论文
共 50 条
  • [1] Open-Domain Question Answering over Tables with Large Language Models
    Liang, Xinyi
    Hu, Rui
    Liu, Yu
    Zhu, Konglin
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024, 2024, 14873 : 347 - 358
  • [2] Learning Strategies for Open-Domain Natural Language Question Answering
    Grois, Eugene
    Wilkins, David C.
    19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), 2005, : 1054 - 1060
  • [3] Advances in open-domain question answering
    Zhang, Zhi-Chang
    Zhang, Yu
    Liu, Ting
    Li, Sheng
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2009, 37 (05): : 1058 - 1069
  • [4] MOQAGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Models
    Zhang, Le
    Wu, Yihong
    Mo, Fengran
    Nie, Jian-Yun
    Agrawal, Aishwarya
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 1195 - 1210
  • [5] Advances in question classification for open-domain question answering
    School of Computer Science and Technology, Anhui University of Technology, Maanshan
    Anhui
    243002, China
    不详
    Jiangsu
    210023, China
    Tien Tzu Hsueh Pao, 8 (1627-1636):
  • [6] Type checking in open-domain question answering
    Schlobach, S
    Olsthoorn, M
    de Rijke, M
    ECAI 2004: 16TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 110 : 398 - 402
  • [7] Detrimental Contexts in Open-Domain Question Answering
    Oh, Philhoon
    Thorne, James
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11589 - 11605
  • [8] Ranking and Sampling in Open-Domain Question Answering
    Xu, Yanfu
    Lin, Zheng
    Liu, Yuanxin
    Liu, Rui
    Wang, Weiping
    Meng, Dan
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2412 - 2421
  • [9] Passage filtering for open-domain Question Answering
    Noguera, Elisa
    Llopis, Fernando
    Ferrandez, Antonio
    ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4139 : 534 - 540
  • [10] Open-domain textual question answering techniques
    Harabagiu, Sanda M.
    Maiorano, Steven J.
    Paşca, Marius A.
    Natural Language Engineering, 2003, 9 (03) : 231 - 267