Safety Analysis of Large Model Content Generation Based on Knowledge Editing

被引：0

作者：

Wang M. ^{[1
]}

Yao Y. ^{[2
]}

Xi Z. ^{[1
]}

Zhang J. ^{[1
]}

Wang P. ^{[1
]}

Xu Z. ^{[1
]}

Zhang N. ^{[1
,2
]}

机构：

[1] School of Software Technology, Zhejiang University, Hangzhou

[2] College of Computer Science and Technology, Zhejiang University, Hangzhou

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2024年 / 61卷 / 05期

关键词：

content generation; dataset; defense; jailbreak prompt; knowledge editing; large language model; safety;

D O I：

10.7544/issn1000-1239.202330965

中图分类号：

学科分类号：

摘要：

Although large language models (LLMs) have achieved remarkable success, they still face security problems in practical applications, and it is easy to generate toxic and harmful content under malicious induction. Existing methods to mitigate the unsafe behavior of LLMs often demand significant computational resources and incur high costs associated with secure data collection. Knowledge editing offers a novel approach to constrain the model’s behavior precisely for specific inputs without the need for retraining, thus saving substantial resources. This approach provides a new feasible avenue for optimizing large models to generate secure content. Nevertheless, existing datasets for mitigating the unsafe behavior of LLMs do not encompass all unsafe scenarios. Moreover, the toxicity issues in these datasets are nearly insurmountable for post-alignment LLMs’ security defenses, hindering the optimization of safety concerns in post-alignment LLMs. In light of these challenges, we introduce a new dataset called SafeGen and propose a novel evaluation framework to analyze the potential of knowledge editing in optimizing the generation of secure content by LLMs. Extensive experiments reveal that knowledge editing demonstrates broad applications in rectifying unsafe behaviors exhibited by LLMs, and editing parameters can enhance the internal safety beliefs of LLMs. However, the fluency of text generated by knowledge editing falls short of expectations, indicating the inherent difficulty of this task. We hope that our work provides insights for the large model security community. © 2024 Science Press. All rights reserved.

引用

页码：1143 / 1155

页数：12

共 45 条

[21] Huang Yangsibo, Gupta S, Xia Mengzhou, Et al., Catastrophic jailbreak of open-source LLMs via exploiting generation, (2023)
[22] Wen Jiaxin, Ke Pei, Sun Hao, Et al., Unveiling the implicit toxicity in large language models, (2023)
[23] Madaan A, Tandon N, Gupta P, Et al., Self-refine: Iterative refinement with self-feedback, (2023)
[24] Welleck S, Lu Ximing, West P, Et al., Generating sequences by learning to self-correct, (2022)
[25] Gandikota R, Materzynska J, Fiotto-Kaufman J, Et al., Erasing concepts from diffusion models[C], Proc of Int Conf on Computer Vision, pp. 2426-2436, (2023)
[26] Yao Yunzhi, Wang Peng, Tian Bozhong, Et al., Editing large language models: Problems, methods, and opportunities, Empirical Methods in Natural Language Processing, pp. 10222-10240, (2023)
[27] Geva M, Caciularu A, Wang K R, Et al., Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space[C], Empirical Methods in Natural Language Processing, pp. 30-45, (2022)
[28] Hu Xinshuo, Li Dongfang, Hu Baotian, Et al., Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation, (2023)
[29] Zhang Yizhe, Galley M, Gao Jianfeng, Et al., Generating informative and diverse conversational responses via adversarial information maximization[C], Advances in Neural Information Processing Systems, pp. 31-56, (2018)
[30] Heryanto Y, Triayudi A., Evaluating text quality of GPT engine davinci-003 and GPT engine davinci generation using BLEU score[J], SAGA: Journal of Technology and Information System, 1, 4, pp. 121-129, (2023)

← 1 2 3 4 5 →