CDNM: Clustering-Based Data Normalization Method For Automated Vulnerability Detection

被引:0
作者
Wu, Tongshuai [1 ,2 ]
Chen, Liwei [1 ,2 ]
Du, Gewangzi [1 ,2 ]
Zhu, Chenguang [1 ,2 ]
Cui, Ningning [1 ,2 ]
Shi, Gang [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Data Normalization; Clustering; Vulnerability Detection; Deep Learning;
D O I
10.1093/comjnl/bxad080
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The key to deep learning vulnerability detection framework is pre-processing source code and learning vulnerability features. Traditional source code representation techniques take a complete normalization to user-defined symbols but ignore the semantic information associated with vulnerabilities. The current mainstream vulnerability feature learning model is Recurrent Neural Network (RNN), whose time-series structure determines its insufficient remote information acquisition capability. This paper proposes a new vulnerability detection framework to solve the above problems. We propose a new data normalization method in the source code pre-processing phase. The user-defined symbols are clustered using the unsupervised clustering algorithm K-means. The normalized classification is performed according to the clustering results, which preserves the primary semantic information in the source code and ensures the smoothness of the sample data. In the feature extraction stage, we input the source code after performing text representation into Bidirectional Encoder Representations for Transformers (BERT) for feature automation learning, which enhances semantic information extraction and remote information acquisition. Experimental results show that the vulnerability detection precision of this method is 18.3% higher than that of the current mainstream vulnerability detection framework in the real-world data collected by ourselves. Further, our method improves the precision of the state-of-the-art method by 4.2%.
引用
收藏
页码:1538 / 1549
页数:12
相关论文
共 50 条
  • [41] Clustering-based dynamic privacy preserving method for social networks
    Beijing Key Laboratory of Intelligent Telecommunications software and Multimedia, School of Computer Science, Beijing University of Posts and Telecommunications, Beijing
    100876, China
    不详
    201204, China
    不详
    100876, China
    Tongxin Xuebao,
  • [42] Detecting Data Accuracy Issues in Textual Geographical Data by a Clustering-based Approach
    Pellegrino, Maria Angela
    Postiglione, Luca
    Scarano, Vittorio
    CODS-COMAD 2021: PROCEEDINGS OF THE 3RD ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA (8TH ACM IKDD CODS & 26TH COMAD), 2021, : 208 - 212
  • [43] GOAL: a clustering-based method for the group optimal location problem
    Fangshu Chen
    Jianzhong Qi
    Huaizhong Lin
    Yunjun Gao
    Dongming Lu
    Knowledge and Information Systems, 2019, 61 : 873 - 903
  • [44] VulHunter: An Automated Vulnerability Detection System Based on Deep Learning and Bytecode
    Guo, Ning
    Li, Xiaoyong
    Yin, Hui
    Gao, Yali
    INFORMATION AND COMMUNICATIONS SECURITY (ICICS 2019), 2020, 11999 : 199 - 218
  • [45] Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection
    Wang, Huanting
    Ye, Guixin
    Tang, Zhanyong
    Tan, Shin Hwei
    Huang, Songfang
    Fang, Dingyi
    Feng, Yansong
    Bian, Lizhong
    Wang, Zheng
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2021, 16 : 1943 - 1958
  • [46] AVDHRAM: Automated Vulnerability Detection based on Hierarchical Representation and Attention Mechanism
    An, Wenyan
    Chen, Liwei
    Wang, Jinxin
    Du, Gewangzi
    Shi, Gang
    Meng, Dan
    2020 IEEE INTL SYMP ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, INTL CONF ON BIG DATA & CLOUD COMPUTING, INTL SYMP SOCIAL COMPUTING & NETWORKING, INTL CONF ON SUSTAINABLE COMPUTING & COMMUNICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2020), 2020, : 337 - 344
  • [47] CIM: A Novel Clustering-based Energy-Efficient Data Imputation Method for Human Activity Recognition
    Hussein, Dina
    Bhat, Ganapati
    ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2023, 22 (05)
  • [48] Clustering-based detection algorithm of remote state estimation under stealthy innovation-based attacks with historical data
    Chen, Shan
    Ni, Yuqing
    Huang, Lingying
    Luan, Xiaoli
    Liu, Fei
    NEUROCOMPUTING, 2025, 616
  • [49] Unsupervised Clustering-Based Non-Coherent Detection for Molecular Communications
    Liu, Shenghan
    Wei, Zhuangkun
    Li, Bin
    Zhao, Chenglin
    IEEE COMMUNICATIONS LETTERS, 2020, 24 (08) : 1687 - 1690
  • [50] CBFS: A Clustering-Based Feature Selection Mechanism for Network Anomaly Detection
    Mao, Jiewen
    Hu, Yongquan
    Jiang, Dong
    Wei, Tongquan
    Shen, Fuke
    IEEE ACCESS, 2020, 8 : 116216 - 116225