Safety Analysis of Large Model Content Generation Based on Knowledge Editing

被引：0

作者：

Wang M. ^{[1
]}

Yao Y. ^{[2
]}

Xi Z. ^{[1
]}

Zhang J. ^{[1
]}

Wang P. ^{[1
]}

Xu Z. ^{[1
]}

Zhang N. ^{[1
,2
]}

机构：

[1] School of Software Technology, Zhejiang University, Hangzhou

[2] College of Computer Science and Technology, Zhejiang University, Hangzhou

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2024年 / 61卷 / 05期

关键词：

content generation; dataset; defense; jailbreak prompt; knowledge editing; large language model; safety;

D O I：

10.7544/issn1000-1239.202330965

中图分类号：

学科分类号：

摘要：

Although large language models (LLMs) have achieved remarkable success, they still face security problems in practical applications, and it is easy to generate toxic and harmful content under malicious induction. Existing methods to mitigate the unsafe behavior of LLMs often demand significant computational resources and incur high costs associated with secure data collection. Knowledge editing offers a novel approach to constrain the model’s behavior precisely for specific inputs without the need for retraining, thus saving substantial resources. This approach provides a new feasible avenue for optimizing large models to generate secure content. Nevertheless, existing datasets for mitigating the unsafe behavior of LLMs do not encompass all unsafe scenarios. Moreover, the toxicity issues in these datasets are nearly insurmountable for post-alignment LLMs’ security defenses, hindering the optimization of safety concerns in post-alignment LLMs. In light of these challenges, we introduce a new dataset called SafeGen and propose a novel evaluation framework to analyze the potential of knowledge editing in optimizing the generation of secure content by LLMs. Extensive experiments reveal that knowledge editing demonstrates broad applications in rectifying unsafe behaviors exhibited by LLMs, and editing parameters can enhance the internal safety beliefs of LLMs. However, the fluency of text generated by knowledge editing falls short of expectations, indicating the inherent difficulty of this task. We hope that our work provides insights for the large model security community. © 2024 Science Press. All rights reserved.

引用

页码：1143 / 1155

页数：12

共 45 条

[1] Jie Huang, Chen-Chuan Chang Kevin, Towards reasoning in large language models: A survey [C], Proc of Findings of the Association for Computational Linguistics, pp. 1049-1065, (2023)
[2] Mori M, MacDorman K, Kageki N., The uncanny valley from the field[J], IEEE Robotics & Automation Magazine, 19, 2, (2012)
[3] Zhang Zhexin, Lei Leqi, Wu Lindong, Et al., Safetybench: Evaluating the safety of large language models with multiple choice questions, (2023)
[4] Sun Hao, Zhang Zhexin, Deng Jiawen, Et al., Safety assessment of Chinese large language models, (2023)
[5] Deshpande A, Murahari V, Rajpurohit T, Et al., Toxicity in ChatGPT: Analyzing persona-assigned language models [J], (2023)
[6] Xiaoyuan Yi, Xing Xie, Unpacking the ethical value alignment in big models[J], Journal of Computer Research and Development, 60, 9, pp. 1926-1945, (2023)
[7] Xi Zhihen, Chen Wenxiang, Guo Xin, Et al., The rise and potential of large language model based agents: A survey, (2023)
[8] Xu Guohai, Liu Jiay, Yan Ming, Et al., Cvalues: Measuring the values of Chinese large language models from safety to responsibility, (2023)
[9] Khalatbari L, Bang Yejin, Su Dan, Et al., Learn What NOT to learn: Towards generative safety in Chatbots, (2023)
[10] Balestriero R, Cosentino R, Shekkizhar S., Characterizing large language model geometry solves toxicity detection and generation, (2023)

← 1 2 3 4 5 →