An Empirical Study of High-Impact Factors for Machine Learning-Based Vulnerability Detection

被引:0
作者
Zheng, Wei [1 ]
Gao, Jialiang [1 ]
Wu, Xiaoxue [2 ]
Xun, Yuxing [1 ]
Liu, Guoliang [1 ]
Chen, Xiang [3 ]
机构
[1] Northwestern Polytech Univ, Sch Software, Xian, Peoples R China
[2] Northwestern Polytech Univ, Sch Cyberspace Secur, Xian, Peoples R China
[3] Nantong Univ, Sch Comp Sci & Technol, Nantong, Peoples R China
来源
PROCEEDINGS OF THE 2020 IEEE 2ND INTERNATIONAL WORKSHOP ON INTELLIGENT BUG FIXING (IBF '20) | 2020年
关键词
Vulnerability Detection; Machine Learning; Comparative Study; Deep Learning; Feature Extraction; MALWARE;
D O I
10.1109/ibf50092.2020.9034888
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vulnerability detection is an important topic of software engineering. To improve the effectiveness and efficiency of vulnerability detection, many traditional machine learning-based and deep learning-based vulnerability detection methods have been proposed. However, the impact of different factors on vulnerability detection is unknown. For example, classification models and vectorization methods can directly affect the detection results and code replacement can affect the features of vulnerability detection. We conduct a comparative study to evaluate the impact of different classification algorithms, vectorization methods and user-defined variables and functions name replacement. In this paper, we collected three different vulnerability code datasets. These datasets correspond to different types of vulnerabilities and have different proportions of source code. Besides, we extract and analyze the features of vulnerability code datasets to explain some experimental results. Our findings from the experimental results can be summarized as follows: (i) the performance of using deep learning is better than using traditional machine learning and BLSTM can achieve the best performance. (ii) CountVectorizer can improve the performance of traditional machine learning. (iii) Different vulnerability types and different code sources will generate different features. We use the Random Forest algorithm to generate the features of vulnerability code datasets. These generated features include system-related functions, syntax keywords, and user-defined names. (iv) Datasets without user-defined variables and functions name replacement will achieve better vulnerability detection results.
引用
收藏
页码:26 / 34
页数:9
相关论文
共 41 条
[1]  
Abdullah MS, 2018, PROCEEDINGS OF THE 2018 CYBER RESILIENCE CONFERENCE (CRC)
[2]   Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification [J].
Ahmadi, Mansour ;
Ulyanov, Dmitry ;
Semenov, Stanislav ;
Trofimov, Mikhail ;
Giacinto, Giorgio .
CODASPY'16: PROCEEDINGS OF THE SIXTH ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY, 2016, :183-194
[3]  
[Anonymous], 2015, P 2015 ACM INT WORKS
[4]  
[Anonymous], Software Assurance Reference Dataset Project
[5]   Extremely scalable storage and clustering of malware metadata [J].
Asquith, Matthew .
JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2016, 12 (02) :49-58
[6]   Effects of Variable Names on Comprehension: An Empirical Study [J].
Avidan, Eran ;
Feitelson, Dror G. .
2017 IEEE/ACM 25TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2017, :55-65
[7]   Detecting Android Security Vulnerabilities Using Machine Learning and System Calls Analysis [J].
Campos, Carlos Renato Salim ;
Jaafar, Fehmi ;
Malik, Yasir .
2019 COMPANION OF THE 19TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS-C 2019), 2019, :109-113
[8]   DeepCPDP: Deep Learning Based Cross-Project Defect Prediction [J].
Chen, Deyu ;
Chen, Xiang ;
Li, Hao ;
Xie, Junfeng ;
Mu, Yanzhou .
IEEE ACCESS, 2019, 7 :184832-184848
[9]  
Chen X., 2019, IEEE T RELIAB
[10]   SEthesaurus: WordNet in Software Engineering [J].
Chen, Xiang ;
Chen, Chunyang ;
Zhang, Dun ;
Xing, Zhenchang .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (09) :1960-1979