Is the Answer in the Text? Challenging ChatGPT with Evidence Retrieval from Instructive Text

被引:0
作者
Henning, Sophie [1 ,2 ]
Anthonio, Talita [1 ,3 ]
Zhou, Wei [1 ,4 ]
Adel, Heike [5 ]
Mesgar, Mohsen [1 ]
Friedrich, Annemarie [4 ]
机构
[1] Bosch Ctr Artificial Intelligence, Renningen, Germany
[2] Ludwig Maximilian Univ Munich, Munich, Germany
[3] Univ Stuttgart, Stuttgart, Germany
[4] Univ Augsburg, Augsburg, Germany
[5] Hsch Medien Stuttgart, Stuttgart, Germany
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年
关键词
AGREEMENT;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article's topic, but not all can be answered using the provided context. We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a text provides answer evidence. Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.
引用
收藏
页码:14229 / 14241
页数:13
相关论文
共 30 条
[1]  
Bird Steven, 2009, Natural language processing with Python:analyzing text with the natural language toolkit
[2]  
Brown T. B., 2020, NeurIPS, P1
[3]  
Carlini N, 2021, PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, P2633
[4]  
Deng Y, 2020, AAAI CONF ARTIF INTE, V34, P7651
[5]  
Dong Y., 2022, P FINDINGS ASS COMPU, P1067
[6]  
Frieder S., 2023, ARXIV
[7]   Survey of Hallucination in Natural Language Generation [J].
Ji, Ziwei ;
Lee, Nayeon ;
Frieske, Rita ;
Yu, Tiezheng ;
Su, Dan ;
Xu, Yan ;
Ishii, Etsuko ;
Bang, Ye Jin ;
Madotto, Andrea ;
Fung, Pascale .
ACM COMPUTING SURVEYS, 2023, 55 (12)
[8]  
Klie J.-C., 2018, P SYST DEM 27 INT C, P5
[9]   Natural Questions: A Benchmark for Question Answering Research [J].
Kwiatkowski, Tom ;
Palomaki, Jennimaria ;
Redfield, Olivia ;
Collins, Michael ;
Parikh, Ankur ;
Alberti, Chris ;
Epstein, Danielle ;
Polosukhin, Illia ;
Devlin, Jacob ;
Lee, Kenton ;
Toutanova, Kristina ;
Jones, Llion ;
Kelcey, Matthew ;
Chang, Ming-Wei ;
Dai, Andrew M. ;
Uszkoreit, Jakob ;
Quoc Le ;
Petrov, Slav .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :453-466
[10]   MEASUREMENT OF OBSERVER AGREEMENT FOR CATEGORICAL DATA [J].
LANDIS, JR ;
KOCH, GG .
BIOMETRICS, 1977, 33 (01) :159-174