Vulnerability Detection via Multiple-Graph-Based Code Representation

被引:3
作者
Qiu, Fangcheng [1 ]
Liu, Zhongxin [1 ]
Hu, Xing [2 ]
Xia, Xin [3 ]
Chen, Gang [4 ]
Wang, Xinyu [4 ]
机构
[1] Zhejiang Univ, State Key Lab Blockchain & Data Secur, Hangzhou 310027, Zhejiang, Peoples R China
[2] Zhejiang Univ, Sch Software Technol, Ningbo 315103, Zhejiang, Peoples R China
[3] Huawei, Software Engn Applicat Technol Lab, Hangzhou 310051, Zhejiang, Peoples R China
[4] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Codes; Source coding; Graph neural networks; Software; Feature extraction; Deep learning; Vulnerability detection; deep learning; code representation; graph neural network;
D O I
10.1109/TSE.2024.3427815
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
During software development and maintenance, vulnerability detection is an essential part of software quality assurance. Even though many program-analysis-based and machine-learning-based approaches have been proposed to automatically detect vulnerabilities, they rely on explicit rules or patterns defined by security experts and suffer from either high false positives or high false negatives. Recently, an increasing number of studies leverage deep learning techniques, especially Graph Neural Network (GNN), to detect vulnerabilities. These approaches leverage program analysis to represent the program semantics as graphs and perform graph analysis to detect vulnerabilities. However, they suffer from two main problems: (i) Existing GNN-based techniques do not effectively learn the structural and semantic features from source code for vulnerability detection. (ii) These approaches tend to ignore fine-grained information in source code. To tackle these problems, in this paper, we propose a novel vulnerability detection approach, named MGVD (MULTIPLE-GRAPH-BASED VULNERABILITY DETECTION), to detect vulnerable functions. To effectively learn the structural and semantic features from source code, MGVD uses three different ways to represent each function into multiple forms, i.e., two statement graphs and a sequence of tokens. Then we encode such representations to a three-channel feature matrix. The feature matrix contains the structural feature and the semantic feature of the function. And we add a weight allocation layer to distribute the weights between structural and semantic features. To overcome the second problem, MGVD constructs each graph representation of the input function using multiple different graphs instead of a single graph. Each graph focuses on one statement in the function and its nodes denote the related statements and their fine-grained code elements. Finally, MGVD leverages CNN to identify whether this function is vulnerable based on such feature matrix. We conduct experiments on 3 vulnerability datasets with a total of 30,341 vulnerable functions and 127,931 non-vulnerable functions. The experimental results show that our method outperforms the state-of-the-art by 9.68% - 10.28% in terms of F1-score.
引用
收藏
页码:2178 / 2199
页数:22
相关论文
共 71 条
  • [51] Sennrich R, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1715
  • [52] Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning
    Shar, Lwin Khin
    Briand, Lionel C.
    Tan, Hee Beng Kuan
    [J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2015, 12 (06) : 688 - 707
  • [53] SEVu1Det: A Semantics-Enhanced Learnable Vulnerability Detector
    Tang, Zhiquan
    Hu, Qiao
    Hu, Yupeng
    Kuang, Wenxin
    Chen, Jiongyi
    [J]. 2022 52ND ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2022), 2022, : 150 - 162
  • [54] Ti, 2023, Zenodo, DOI 10.5281/ZENODO.8130972
  • [55] Velickovic P., 2018, P 6 INT C LEARNING R
  • [56] Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection
    Wang, Huanting
    Ye, Guixin
    Tang, Zhanyong
    Tan, Shin Hwei
    Huang, Songfang
    Fang, Dingyi
    Feng, Yansong
    Bian, Lizhong
    Wang, Zheng
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2021, 16 : 1943 - 1958
  • [57] Automatically Learning Semantic Features for Defect Prediction
    Wang, Song
    Liu, Taiyue
    Tan, Lin
    [J]. 2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2016, : 297 - 308
  • [58] Wang WH, 2020, PROCEEDINGS OF THE 2020 IEEE 27TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER '20), P261, DOI [10.1109/SANER48275.2020.9054857, 10.1109/saner48275.2020.9054857]
  • [59] Welling M., 2016, INT C LEARNING REPRE
  • [60] Deep Learning Code Fragments for Code Clone Detection
    White, Martin
    Tufano, Michele
    Vendome, Christopher
    Poshyvanyk, Denys
    [J]. 2016 31ST IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2016, : 87 - 98