Is the Answer in the Text? Challenging ChatGPT with Evidence Retrieval from Instructive Text

被引：0

作者：

Henning, Sophie ^{[1
,2
]}

Anthonio, Talita ^{[1
,3
]}

Zhou, Wei ^{[1
,4
]}

Adel, Heike ^{[5
]}

Mesgar, Mohsen ^{[1
]}

Friedrich, Annemarie ^{[4
]}

机构：

[1] Bosch Ctr Artificial Intelligence, Renningen, Germany

[2] Ludwig Maximilian Univ Munich, Munich, Germany

[3] Univ Stuttgart, Stuttgart, Germany

[4] Univ Augsburg, Augsburg, Germany

[5] Hsch Medien Stuttgart, Stuttgart, Germany

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年

关键词：

AGREEMENT;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Generative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article's topic, but not all can be answered using the provided context. We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a text provides answer evidence. Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.

引用

页码：14229 / 14241

页数：13

共 30 条

[1]

Bird Steven, 2009, Natural language processing with Python:analyzing text with the natural language toolkit

[2]

Brown T. B., 2020, NeurIPS, P1

[3]

Carlini N, 2021, PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, P2633

[4]

Deng Y, 2020, AAAI CONF ARTIF INTE, V34, P7651

[5]

Dong Y., 2022, P FINDINGS ASS COMPU, P1067

[6]

Frieder S., 2023, ARXIV

[7] Survey of Hallucination in Natural Language Generation [J].

Ji, Ziwei ;

Lee, Nayeon ;

Frieske, Rita ;

Yu, Tiezheng ;

Su, Dan ;

Xu, Yan ;

Ishii, Etsuko ;

Bang, Ye Jin ;

Madotto, Andrea ;

Fung, Pascale .

ACM COMPUTING SURVEYS, 2023, 55 (12)

[8]

Klie J.-C., 2018, P SYST DEM 27 INT C, P5

[9] Natural Questions: A Benchmark for Question Answering Research [J].

Kwiatkowski, Tom ;

Palomaki, Jennimaria ;

Redfield, Olivia ;

Collins, Michael ;

Parikh, Ankur ;

Alberti, Chris ;

Epstein, Danielle ;

Polosukhin, Illia ;

Devlin, Jacob ;

Lee, Kenton ;

Toutanova, Kristina ;

Jones, Llion ;

Kelcey, Matthew ;

Chang, Ming-Wei ;

Dai, Andrew M. ;

Uszkoreit, Jakob ;

Quoc Le ;

Petrov, Slav .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :453-466

[10] MEASUREMENT OF OBSERVER AGREEMENT FOR CATEGORICAL DATA [J].

LANDIS, JR ;

KOCH, GG .

BIOMETRICS, 1977, 33 (01) :159-174

← 1 2 3 →