Can Large Language Models Be an Alternative to Human Evaluation?

被引:0
作者
Chiang, Cheng-Han [1 ]
Lee, Hung-yi [1 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
来源
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs. We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.
引用
收藏
页码:15607 / 15631
页数:25
相关论文
共 44 条
  • [1] Alzantot M, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2890
  • [2] Amidei Jacopo, 2018, P 27 INT C COMPUTATI, P3318
  • [3] Amidei Jacopo, 2019, P 12 INT C NATURAL L, P344, DOI DOI 10.18653/V1/W19-8642
  • [4] Bai Yuntao, 2022, arXiv preprint arXiv:2204.05862
  • [5] Bai Yuntao., 2022, arXiv preprint arXiv:2212.08073
  • [6] Botsch R. E., 2011, Significance and measures of association, P12, DOI DOI 10.1037/0033-3204.44.2.175
  • [7] Chiang Cheng-Han, 2022, ARXIV221002844
  • [8] Clark Elizabeth., 2021, Long Papers, P7282, DOI [DOI 10.18653/V1/2021.ACL-LONG.565, 10.18653/v1/2021.acl-long.565]
  • [9] De Cao Nicola, 2021, Editing factual knowledge in language models
  • [10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171