FELM: Benchmarking Factuality Evaluation of Large Language Models

被引：0

作者：

Chen, Shiqi ^{[1
,2
]}

Zhao, Yiran ^{[3
]}

Zhang, Jinghan ^{[2
]}

Chern, I-Chun ^{[4
]}

Gao, Siyang ^{[1
]}

Liu, Pengfei ^{[5
]}

He, Junxian ^{[2
]}

机构：

[1] City Univ Hong Kong, Hong Kong, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Carnegie Mellon Univ, Pittsburgh, PA USA

[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)

引用

页数：22

共 50 条

[31] Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
Chen, Yihan
Xu, Benfeng
Wang, Quan
Liu, Yi
Mao, Zhendong
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17808 - 17816
[32] Benchmarking Causal Study to Interpret Large Language Models for Source Code
Rodriguez-Cardenas, Daniel
Palacio, David N.
Khati, Dipin
Burke, Henry
Poshyvanyk, Denys
2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334
[33] A Survey on Evaluation of Large Language Models
Chang, Yupeng
Wang, Xu
Wang, Jindong
Wu, Yuan
Yang, Linyi
Zhu, Kaijie
Chen, Hao
Yi, Xiaoyuan
Wang, Cunxiang
Wang, Yidong
Ye, Wei
Zhang, Yue
Chang, Yi
Yu, Philip S.
Yang, Qiang
Xie, Xing
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
[34] StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Guo, Zhicheng
Cheng, Sijie
Wang, Hao
Liang, Shihao
Qin, Yujia
Li, Peng
Liu, Zhiyuan
Sun, Maosong
Liu, Yang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11143 - 11156
[35] Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System
Huang, Hengguan
Wang, Songtao
Liu, Hongfu
Wang, Hao
Wang, Ye
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1624 - 1637
[36] EchoSwift An Inference Benchmarking and Configuration Discovery Tool for Large Language Models (LLMs)
Krishna, Karthik
Bandili, Ramana
COMPANION OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE COMPANION 2024, 2024, : 158 - 162
[37] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study
Tamberg, Karl
Bahsi, Hayretdin
IEEE ACCESS, 2025, 13 : 29698 - 29717
[38] Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph
Vashurin, Roman
Fadeeva, Ekaterina
Vazhentsev, Artem
Rvanova, Lyudmila
Vasilev, Daniil
Tsvigun, Akim
Petrakov, Sergey
Xing, Rui
Sadallah, Abdelrahman
Grishchenkov, Kirill
Panchenko, Alexander
Baldwin, Timothy
Nakov, Preslav
Panov, Maxim
Shelmanov, Artem
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2025, 13 : 220 - 248
[39] Large language models and rheumatology: a comparative evaluation
Venerito, Vincenzo
Puttaswamy, Darshan
Iannone, Florenzo
Gupta, Latika
LANCET RHEUMATOLOGY, 2023, 5 (10): : E574 - E578
[40] Automatic Evaluation of Attribution by Large Language Models
Yue, Xiang
Wang, Boshi
Chen, Ziru
Zhang, Kai
Su, Yu
Sun, Huan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4615 - 4635

← 1 2 3 4 5 →