GalaxyGPT: A Hybrid Framework for Large Language Model Safety

被引:0
作者
Zhou, Hange [1 ]
Zheng, Jiabin [2 ]
Zhang, Longtu [1 ]
机构
[1] Geely Automobile Res Inst Ningbo Co Ltd, Ningbo 315336, Peoples R China
[2] Zhejiang Univ, Sch Aeronaut & Astronaut, Hangzhou 310027, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Artificial intelligence; Content management; content moderation; ChatGPT; large language model; model safety; prompt engineering; supervised fine-tuning;
D O I
10.1109/ACCESS.2024.3425662
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The challenge of balancing safety and utility in Large Language Models (LLMs) requires novel solutions that go beyond conventional methods of pre- and post-processing, red-teaming, and feedback fine-tuning. In response to this, we introduce GalaxyGPT, a framework that synergizes safety moderation services of Internet vendors with LLMs to enhance safety performance. This necessity arises from the growing complexity of online interactions and the imperative to ensure that LLMs operate within safe and ethical boundaries without compromising their utility. GalaxyGPT leverages advanced algorithms and a comprehensive dataset to significantly improve safety measures, achieving notable accuracy (95.8%) and F1-score (94.5%) through evaluations of our custom dataset comprising 500 single-round safety tests, 100 multi-round dialogue tests, and 200 open-source tests. These results starkly outperform the safety metrics of APIs from six vendors (average 40.5% accuracy) and LLMs without GalaxyGPT integration (73% accuracy). Additionally, we contribute to the community by releasing an open-source test set of 600 entries and a compact classification model for security tasks, specifically designed to challenge and enhance the robustness of APIs, thereby facilitating the efficient deployment and application of GalaxyGPT in diverse environments.
引用
收藏
页码:94436 / 94451
页数:16
相关论文
共 33 条
  • [1] Arrieta AB, 2019, Arxiv, DOI arXiv:1910.10045
  • [2] Askell A, 2021, Arxiv, DOI [arXiv:2112.00861, DOI 10.48550/ARXIV.2112.00861]
  • [3] Bai JZ, 2023, Arxiv, DOI [arXiv:2309.16609, DOI 10.48550/ARXIV.2309.16609]
  • [4] Bai Y., 2022, arXiv
  • [5] baidu, Content Review Content Safety Intelligent Review Baidu AI Open Platform
  • [6] Carlini N, 2021, PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, P2633
  • [7] Detecting Offensive Language in Social Media to Protect Adolescent Online Safety
    Chen, Ying
    Zhou, Yilu
    Zhu, Sencun
    Xu, Heng
    [J]. PROCEEDINGS OF 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON PRIVACY, SECURITY, RISK AND TRUST AND 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING (SOCIALCOM/PASSAT 2012), 2012, : 71 - 80
  • [8] Clark Elizabeth, 2021, arXiv, DOI [10.48550/arXiv.2107.00061, DOI 10.48550/ARXIV.2107.00061, DOI 10.1371/JOURNAL.PONE.0025085]
  • [9] Deshpande A, 2023, Arxiv, DOI arXiv:2304.05335
  • [10] Du ZX, 2022, Arxiv, DOI [arXiv:2103.10360, DOI 10.48550/ARXIV.2103.10360]