Neural Transfer Learning for Repairing Security Vulnerabilities in C Code

被引:61
作者
Chen, Zimin [1 ]
Kommrusch, Steve [2 ]
Monperrus, Martin [1 ]
机构
[1] KTH Royal Inst Technol, S-11428 Stockholm, Sweden
[2] Colorado State Univ, Ft Collins, CO 80523 USA
基金
瑞典研究理事会; 美国国家科学基金会;
关键词
Vulnerability fixing; transfer learning; seq2seq learning; NETWORKS;
D O I
10.1109/TSE.2022.3147265
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related and that the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus and is then tuned on a vulnerability fix dataset, which is an order of magnitude smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. We also show that the transfer learning model performs better than a model trained with a denoising task and fine-tuned on the vulnerability fixing task. To sum up, this paper shows that transfer learning works well for repairing security vulnerabilities in C compared to learning on a small dataset.
引用
收藏
页码:147 / 165
页数:19
相关论文
共 81 条
[1]  
Adams O, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P937
[2]  
Ahmad WU, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P2655
[3]   Compilation Error Repair: For the Student Programs, From the Student Programs [J].
Ahmed, Umair Z. ;
Kumar, Pawan ;
Karkare, Amey ;
Kar, Purushottam ;
Gulwani, Sumit .
2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING EDUCATION AND TRAINING (ICSE-SEET), 2018, :78-87
[4]   The Adverse Effects of Code Duplication in Machine Learning Models of Code [J].
Allamams, Miltiadis .
PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, :143-153
[5]   A Survey of Machine Learning for Big Code and Naturalness [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Devanbu, Premkumar ;
Sutton, Charles .
ACM COMPUTING SURVEYS, 2018, 51 (04)
[6]  
Alon U., 2018, code2seq: Generating sequences from structured representations of code
[7]   The Mayhem Cyber Reasoning System [J].
Avgerinos, Thanassis ;
Brumley, David ;
Davis, John ;
Goulden, Ryan ;
Nighswander, Tyler ;
Rebert, Alex ;
Williamson, Ned .
IEEE SECURITY & PRIVACY, 2018, 16 (02) :52-60
[8]   A Software-Repair Robot Based on Continual Learning [J].
Baudry, Benoit ;
Chen, Zimin ;
Etemadi, Khashayar ;
Fu, Han ;
Ginelli, Davide ;
Kommrusch, Steve ;
Martinez, Matias ;
Monperrus, Martin ;
Ron Arteaga, Javier ;
Ye, He ;
Yu, Zhongxing .
IEEE SOFTWARE, 2021, 38 (04) :28-35
[9]   CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software [J].
Bhandari, Guru ;
Naseer, Amara ;
Moonen, Leon .
PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING (PROMISE '21), 2021, :30-39
[10]  
Bojar O., 2014, P 9 WORKSHOP STAT MA, P12