LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations

被引：15

作者：

Tony, Catherine ^{[1
]}

Mutas, Markus ^{[1
]}

Ferreyra, Nicolas E. Diaz ^{[1
]}

Scandariato, Riccardo ^{[1
]}

机构：

[1] Hamburg Univ Technol, Inst Software Secur, Hamburg, Germany

来源：

2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR | 2023年

关键词：

LLMs; code security; NL prompts; CWE; KAPPA;

D O I：

10.1109/MSR59073.2023.00084

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. Moreover, these models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs promise an effortless NL-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented. In this work, we present LLM-SecEval, a dataset containing 150 NL prompts that can be leveraged for assessing the security performance of such models. Such prompts are NL descriptions of code snippets prone to various security vulnerabilities listed in MITRE's Top 25 Common Weakness Enumeration (CWE) ranking. Each prompt in our dataset comes with a secure implementation example to facilitate comparative evaluations against code produced by LLMs. As a practical application, we show how LLMSecEval can be used for evaluating the security of snippets automatically generated from NL descriptions.

引用

页码：588 / 592

页数：5

共 18 条

[1]

Austin J., 2021, CORR

[2]

Chen M., 2021, Evaluating large language models trained on code, DOI DOI 10.48550/ARXIV.2107.03374

[3] EQUIVALENCE OF WEIGHTED KAPPA AND INTRACLASS CORRELATION COEFFICIENT AS MEASURES OF RELIABILITY [J].

FLEISS, JL ;

COHEN, J .

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1973, 33 (03) :613-619

[4]

G. Inc, GITH COP

[5] Pre-trained models: Past, present and future [J].

Han, Xu ;

Zhang, Zhengyan ;

Ding, Ning ;

Gu, Yuxian ;

Liu, Xiao ;

Huo, Yuqi ;

Qiu, Jiezhong ;

Yao, Yuan ;

Zhang, Ao ;

Zhang, Liang ;

Han, Wentao ;

Huang, Minlie ;

Jin, Qin ;

Lan, Yanyan ;

Liu, Yang ;

Liu, Zhiyuan ;

Lu, Zhiwu ;

Qiu, Xipeng ;

Song, Ruihua ;

Tang, Jie ;

Wen, Ji-Rong ;

Yuan, Jinhui ;

Zhao, Wayne Xin ;

Zhu, Jun .

AI OPEN, 2021, 2 :225-250

[6]

Hazhirpasand M., 2020, ESEM 20 ACM IEEE INT, P1

[7] Correlating Automated and Human Evaluation of Code Documentation Generation Quality [J].

Hu, Xing ;

Chen, Qiuyuan ;

Wang, Haoye ;

Xia, Xin ;

Lo, David ;

Zimmermann, Thomas .

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2022, 31 (04)

[8]

Kalliamvakou E., 2022, Research: Quantifying GitHub Copilots Impact on Developer Productivity and Happiness

[9] Interrater reliability: the kappa statistic [J].

McHugh, Mary L. .

BIOCHEMIA MEDICA, 2012, 22 (03) :276-282

[10]

MITRE, 2021, CWE Top 25 Most Dangerous Software Weaknesses,

← 1 2 →