JAILBREAK ANTIDOTE: RUNTIME SAFETY-UTILITY BALANCE VIA SPARSE REPRESENTATION ADJUSTMENT IN LARGE LANGUAGE MODELS

被引:0
|
作者
Shen, Guobin [1 ,2 ,3 ,4 ]
Zhao, Dongcheng [1 ,2 ,3 ]
Dong, Yiting [1 ,2 ,3 ,4 ]
He, Xiang [1 ,2 ,3 ]
Zeng, Yi [1 ,2 ,3 ,4 ]
机构
[1] Brain-inspired Cognitive Intelligence Lab., Institute of Automation, Chinese Academy of Sciences, China
[2] Beijing Institute of AI Safety and Governance, China
[3] Center for Long-term Artificial Intelligence, China
[4] School of Future Technology, University of Chinese Academy of Sciences, China
来源
arXiv |
关键词
Compendex;
D O I
暂无
中图分类号
学科分类号
摘要
引用
收藏
相关论文
共 1 条
  • [1] Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models
    Ghilardi, Davide
    Belotti, Federico
    Molinari, Marco
    Lim, Jaehyuk
    BlackboxNLP 2024 - 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP - Proceedings of the Workshop, 2024, : 530 - 550