FELM: Benchmarking Factuality Evaluation of Large Language Models

被引:0
|
作者
Chen, Shiqi [1 ,2 ]
Zhao, Yiran [3 ]
Zhang, Jinghan [2 ]
Chern, I-Chun [4 ]
Gao, Siyang [1 ]
Liu, Pengfei [5 ]
He, Junxian [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Carnegie Mellon Univ, Pittsburgh, PA USA
[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Factuality Enhanced Language Models for Open-Ended Text Generation
    Lee, Nayeon
    Ping, Wei
    Xu, Peng
    Patwary, Mostofa
    Fung, Pascale
    Shoeybi, Mohammad
    Catanzaro, Bryan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [42] Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study
    Xu, Liuchang
    Zhao, Shuo
    Lin, Qingming
    Chen, Luyao
    Luo, Qianqian
    Wu, Sensen
    Ye, Xinyue
    Feng, Hailin
    Du, Zhenhong
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2025, 18 (01)
  • [43] (sic) UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
    Liang, Xun
    Song, Shichao
    Niu, Simin
    Li, Zhiyu
    Xiong, Feiyu
    Tang, Bo
    Wang, Yezhaohui
    He, Dawei
    Cheng, Peng
    Wang, Zhonghao
    Deng, Haiying
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5266 - 5293
  • [44] RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
    Wang, Zekun Moore
    Peng, Zhongyuan
    Qu, Haoran
    Li, Jiaheng
    Zhou, Wangchunshu
    Wu, Yuhan
    Guo, Hongcheng
    Gan, Ruitong
    Ni, Zehao
    Yang, Jian
    Zhang, Man
    Zhang, Zhaoxiang
    Ouyang, Wanli
    Xu, Ke
    Huang, Stephen W.
    Fu, Jie
    Peng, Junran
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14743 - 14777
  • [45] PromptBench: A Unified Library for Evaluation of Large Language Models
    Zhu, Kaijie
    Zhao, Qinlin
    Chen, Hao
    Wang, Jindong
    Xie, Xing
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 22
  • [46] Updating knowledge in Large Language Models: an Empirical Evaluation
    Marinelli, Alberto Roberto
    Carta, Antonio
    Passaro, Lucia C.
    IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS 2024, IEEE EAIS 2024, 2024, : 289 - 296
  • [47] On the Evaluation of Large Language Models in Unit Test Generation
    Yang, Lin
    Yang, Chen
    Gao, Shutao
    Wang, Weijing
    Wang, Bo
    Zhu, Qihao
    Chu, Xiao
    Zhou, Jianyi
    Liang, Guangtai
    Wang, Qianxiang
    Chen, Junjie
    Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 1607 - 1619
  • [48] A Comprehensive Evaluation of Quantization Strategies for Large Language Models
    Jin, Renren
    Du, Jiangcun
    Huang, Wuwei
    Liu, Wei
    Lu, Jian
    Wang, Bin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 12186 - 12215
  • [49] Can Large Language Models Be an Alternative to Human Evaluation?
    Chiang, Cheng-Han
    Lee, Hung-yi
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15607 - 15631
  • [50] Are large language models qualified reviewers in originality evaluation?
    Huang, Shengzhi
    Huang, Yong
    Liu, Yinpeng
    Luo, Zhuoran
    Lu, Wei
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)