Detecting code vulnerabilities by learning from large-scale open source repositories

被引:6
作者
Xu, Rongze [1 ]
Tang, Zhanyong [1 ]
Ye, Guixin [1 ]
Wang, Huanting [1 ]
Ke, Xin [1 ]
Fang, Dingyi [1 ]
Wang, Zheng [2 ]
机构
[1] Northwest Univ, Xian, Peoples R China
[2] Univ Leeds, Leeds, England
基金
中国国家自然科学基金;
关键词
Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability; LSTM;
D O I
10.1016/j.jisa.2022.103293
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model's capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.We present DEVELOPER,(1 )a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, DEVELOPER automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, DEVELOPER employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network - a bidirectional long-short term memory architecture - to predict if the target code contains a vulnerability or not. We apply DEVELOPER to identify vulnerabilities at the program source-code level. Our evaluation shows that DEVELOPER outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.
引用
收藏
页数:14
相关论文
共 58 条
[1]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[2]  
[Anonymous], GITHUB GITHUB
[3]   On the Impact of Programming Languages on Code Quality: A Reproduction Study [J].
Berger, Emery D. ;
Hollenbeck, Celeste ;
Maj, Petr ;
Vitek, Olga ;
Vitek, Jan .
ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 2019, 41 (04)
[4]  
Büch L, 2019, 2019 IEEE 26TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER), P95, DOI [10.1109/saner.2019.8668039, 10.1109/SANER.2019.8668039]
[5]   Compiler Fuzzing through Deep Learning [J].
Cummins, Chris ;
Petoumenos, Pavlos ;
Murray, Alastair ;
Leather, Hugh .
ISSTA'18: PROCEEDINGS OF THE 27TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, 2018, :95-105
[6]   LEOPARD: Identifying Vulnerable Code for Vulnerability Assessment Through Program Metrics [J].
Du, Xiaoning ;
Chen, Bihuan ;
Li, Yuekang ;
Guo, Jianmin ;
Zhou, Yaqin ;
Liu, Yang ;
Jiang, Yu .
2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2019), 2019, :60-71
[7]  
Engineering H.S.S. (HSSEDI) D.I., CWE
[8]  
Findbugs, 1995, US
[9]   Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey [J].
Ghaffarian, Seyed Mohammad ;
Shahriari, Hamid Reza .
ACM COMPUTING SURVEYS, 2017, 50 (04)
[10]  
GitHub I. b, GITHUB DOCS