SUMMEDITS: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization

被引:0
作者
Laban, Philippe [1 ]
Kryscinski, Wojciech [1 ]
Agarwal, Divyansh [1 ]
Fabbri, Alexander R. [1 ]
Xiong, Caiming [1 ]
Joty, Shafiq [1 ]
Wu, Chien-Sheng [1 ]
机构
[1] Salesforce AI, New York, NY 10036 USA
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SUMMEDITS. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SUMMEDITS, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.
引用
收藏
页码:9662 / 9676
页数:15
相关论文
共 48 条
[1]  
Arabzadeh Negar, 2022, ARXIV
[2]  
Bai Y, 2022, ARXIV
[3]  
Bao SQ, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P85
[4]  
Brown TB, 2020, ADV NEUR IN, V33
[5]  
Cachola I, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P4766
[6]  
Cao SY, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P6633
[7]  
Carterette Ben, 2020, ARXIV
[8]  
Chalkidis I., 2020, arXiv
[9]  
Chiang Wei-Lin., 2023, Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
[10]   Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark [J].
Dziri, Nouha ;
Rashkin, Hannah ;
Linzen, Tal ;
Reitter, David .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 :1066-1083