(sic) UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

被引：0

作者：

Liang, Xun ^{[1
]}

Song, Shichao ^{[1
]}

Niu, Simin ^{[1
]}

Li, Zhiyu ^{[2
]}

Xiong, Feiyu ^{[2
]}

Tang, Bo ^{[2
]}

Wang, Yezhaohui ^{[2
]}

He, Dawei ^{[3
]}

Cheng, Peng ^{[3
]}

Wang, Zhonghao ^{[3
]}

Deng, Haiying ^{[3
]}

机构：

[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China

[2] Inst Adv Algorithms Res, Shanghai, Peoples R China

[3] State Key Lab Media Convergence Prod Technol & Sy, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large language models (LLMs) produce hallucinated text, compromising their practical utility in professional contexts. To assess the reliability of LLMs, numerous initiatives have developed benchmark evaluations for hallucination phenomena. However, they often employ constrained generation techniques to produce the evaluation dataset due to cost and time limitations. For instance, this may involve employing directed hallucination induction or deliberately modifying authentic text to generate hallucinations. These are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, containing hallucinations generated by LLMs with minimal restrictions(1). Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also evaluated prominent Chinese LLMs and the GPT series models to derive insights regarding hallucination.

引用

页码：5266 / 5293

页数：28

共 43 条

[1] Chang Yupeng, 2024, ACM Trans. Intell. Syst. Technol. Just Accepted
[2] Chen D, 2024, Arxiv, DOI [arXiv:2401.03385, DOI 10.48550/ARXIV.2401.03385]
[3] Chen JJ, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, P9890
[4] Cheng QY, 2023, Arxiv, DOI arXiv:2310.03368
[5] Dependency Grammar
de Marneffe, Marie-Catherine
Nivre, Joakim
[J]. ANNUAL REVIEW OF LINGUISTICS, VOL 5, 2019, 5 : 197 - 218
[6] Du Z., 2022, Long Papers, V1, P320
[7] Elaraby M, 2023, Arxiv, DOI arXiv:2308.11764
[8] Fu JL, 2023, Arxiv, DOI arXiv:2302.04166
[9] InternL M., 2023, Internlm: A multilingual language model with progressively enhanced capabilities
[10] Synthesis and performance investigation of carbon black hyperdispersant IMD
Jiang, Shuangshuang
Bai, Jue
Ying, Xiaogang
Han, Jianlin
Zhu, Kai
Mao, Lianshan
[J]. JOURNAL OF COATINGS TECHNOLOGY AND RESEARCH, 2023, 20 (05) : 1529 - 1539

← 1 2 3 4 5 →