A Chinese Dataset for Evaluating the Safeguards in Large Language Models

被引：0

作者：

Wang, Yuxia ^{[1
,2
]}

Zhai, Zenan ^{[1
]}

Li, Haonan ^{[1
,2
]}

Han, Xudong ^{[1
,2
]}

Lin, Lizhi ^{[4
,5
]}

Zhang, Zhenxuan ^{[1
]}

Zhao, Jingru ^{[5
]}

Nakov, Preslav ^{[2
]}

Baldwin, Timothy ^{[1
,2
,3
]}

机构：

[1] LibrAI, Abu Dhabi, U Arab Emirates

[2] MBZUAI, Abu Dhabi, U Arab Emirates

[3] Univ Melbourne, Melbourne, Vic, Australia

[4] Tsinghua Univ, Beijing, Peoples R China

[5] MiraclePlus, Beijing, Peoples R China

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks. Previous studies have proposed comprehensive taxonomies of LLM risks, as well as corresponding prompts that can be used to examine LLM safety. However, the focus has been almost exclusively on English. We aim to broaden LLM safety research by introducing a dataset for the safety evaluation of Chinese LLMs, and extending it to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments over five LLMs show that region-specific risks are the prevalent risk type. Warning: this paper contains example data that may be offensive, harmful, or biased.

引用

页码：3106 / 3119

页数：14

共 33 条

[1]

Bai Jinze, 2023, arXiv

[2]

Deng Y, 2024, Arxiv, DOI [arXiv:2310.06474, 10.48550/arXiv.2310.06474]

[3] BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation [J].

Dhamala, Jwala ;

Sun, Tony ;

Kumar, Varun ;

Krishna, Satyapriya ;

Pruksachatkun, Yada ;

Chang, Kai-Wei ;

Gupta, Rahul .

PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, :862-872

[4]

Ding P, 2024, Arxiv, DOI [arXiv:2311.08268, 10.48550/arXiv.2311.08268]

[5]

Gade P, 2024, Arxiv, DOI [arXiv:2311.00117, 10.48550/arXiv.2311.00117]

[6]

Gehman S., 2020, FINDINGS ASS COMPUTA, V2020, P3356, DOI [DOI 10.18653/V1/2020.FINDINGSEMNLP.301, 10.18653/v1/2020.findings-emnlp.301, DOI 10.18653/V1/2020.FINDINGS-EMNLP.301.URL, DOI 10.18653/V1/2020.FINDINGS-EMNLP.301]

[7]

Han XD, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P2760

[8]

Hartvigsen T, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P3309

[9]

Lapid R, 2024, Arxiv, DOI [arXiv:2309.01446, DOI 10.48550/ARXIV.2309.01446, 10.48550/arXiv.2309.01446]

[10]

Li X, 2024, Arxiv, DOI arXiv:2311.03191

← 1 2 3 4 →