Neural Transfer Learning for Repairing Security Vulnerabilities in C Code

被引:41
作者
Chen, Zimin [1 ]
Kommrusch, Steve [2 ]
Monperrus, Martin [1 ]
机构
[1] KTH Royal Inst Technol, S-11428 Stockholm, Sweden
[2] Colorado State Univ, Ft Collins, CO 80523 USA
基金
美国国家科学基金会; 瑞典研究理事会;
关键词
Vulnerability fixing; transfer learning; seq2seq learning; NETWORKS;
D O I
10.1109/TSE.2022.3147265
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related and that the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus and is then tuned on a vulnerability fix dataset, which is an order of magnitude smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. We also show that the transfer learning model performs better than a model trained with a denoising task and fine-tuned on the vulnerability fixing task. To sum up, this paper shows that transfer learning works well for repairing security vulnerabilities in C compared to learning on a small dataset.
引用
收藏
页码:147 / 165
页数:19
相关论文
共 81 条
  • [1] Adams O, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P937
  • [2] Ahmad WU, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P2655
  • [3] Compilation Error Repair: For the Student Programs, From the Student Programs
    Ahmed, Umair Z.
    Kumar, Pawan
    Karkare, Amey
    Kar, Purushottam
    Gulwani, Sumit
    [J]. 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING EDUCATION AND TRAINING (ICSE-SEET), 2018, : 78 - 87
  • [4] The Adverse Effects of Code Duplication in Machine Learning Models of Code
    Allamams, Miltiadis
    [J]. PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, : 143 - 153
  • [5] A Survey of Machine Learning for Big Code and Naturalness
    Allamanis, Miltiadis
    Barr, Earl T.
    Devanbu, Premkumar
    Sutton, Charles
    [J]. ACM COMPUTING SURVEYS, 2018, 51 (04)
  • [6] Alon U., 2018, ARXIV
  • [7] The Mayhem Cyber Reasoning System
    Avgerinos, Thanassis
    Brumley, David
    Davis, John
    Goulden, Ryan
    Nighswander, Tyler
    Rebert, Alex
    Williamson, Ned
    [J]. IEEE SECURITY & PRIVACY, 2018, 16 (02) : 52 - 60
  • [8] A Software-Repair Robot Based on Continual Learning
    Baudry, Benoit
    Chen, Zimin
    Etemadi, Khashayar
    Fu, Han
    Ginelli, Davide
    Kommrusch, Steve
    Martinez, Matias
    Monperrus, Martin
    Ron Arteaga, Javier
    Ye, He
    Yu, Zhongxing
    [J]. IEEE SOFTWARE, 2021, 38 (04) : 28 - 35
  • [9] CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software
    Bhandari, Guru
    Naseer, Amara
    Moonen, Leon
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING (PROMISE '21), 2021, : 30 - 39
  • [10] Bojar Ond.rej, 2014, P 9 WORKSHOP STAT MA, P12