Machine Learning Approaches to Code Similarity Measurement: A Systematic Review

被引:3
作者
Zhang, Zixian [1 ]
Saber, Takfarinas [2 ]
机构
[1] Univ Galway, Sch Comp Sci, CRT AI, Galway H91TK33, Ireland
[2] Univ Galway, Sch Comp Sci, Lero, Galway H91TK33, Ireland
基金
爱尔兰科学基金会;
关键词
Codes; Cloning; Systematic literature review; Syntactics; Plagiarism; Semantics; Machine learning; Unsupervised learning; Source coding; Machine learning algorithms; Code similarity; code clone; machine learning; deep learning; systematic literature review; GRAPH EDIT DISTANCE; CLONE DETECTION; DETECTION FRAMEWORK; SEMANTIC CODE;
D O I
10.1109/ACCESS.2025.3553392
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Source code similarity measurement, which involves assessing the degree of difference between code segments, plays a crucial role in various aspects of the software development cycle. These include but are not limited to code quality assurance, code review processes, code plagiarism detection, security, and vulnerability analysis. Despite the increasing application of ML technique in this domain, a comprehensive synthesis of existing methodologies remains lacking. This paper presents a systematic review of Machine Learning techniques applied to code similarity measurement, aiming to illuminate current methodologies and contribute valuable insights to the research community. Following a rigorous systematic review protocol, we identified and analyzed 84 primary studies on a broad spectrum of dimensions covering application type, devised Machine Learning algorithms, used code representations, datasets, and performance metrics, as well as performance evaluations. A deep investigation reveals that 15 applications for code similarity measurement have utilized 51 different machine learning algorithms. Additionally, the most prevalent code representation is found to be the abstract syntax tree (AST). Furthermore, the most frequently employed dataset across various code similarity research applications is BigCloneBench. Through this comprehensive analysis, the paper not only synthesizes existing research but also identifies prevailing limitations and challenges, shedding light on potential avenues for future work.
引用
收藏
页码:51729 / 51764
页数:36
相关论文
共 200 条
[81]  
Keele S., 2007, Guidelines for performing systematic literature reviews in software engineering
[82]  
Keivanloo I, 2015, 2015 22ND INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), P201, DOI 10.1109/SANER.2015.7081830
[83]   What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning [J].
Keller, Patrick ;
Kabore, Abdoul Kader ;
Plein, Laura ;
Klein, Jacques ;
Le Traon, Yves ;
Bissyande, Tegawende F. .
ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2022, 31 (02)
[84]   Investigating the Efficacy of Large Language Models for Code Clone Detection [J].
Khajezade, Mohamad ;
Wu, Jie J. W. ;
Fard, Fatemeh Hendijani ;
Rodriguez-Perez, Gema ;
Shehata, Mohamed Sami .
PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, :161-165
[85]   BigCloneBench Considered Harmful for Machine Learning [J].
Krinke, Jens ;
Ragkhitwetsagul, Chaiyong .
2022 IEEE 16TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC 2022), 2022, :1-7
[86]  
Kun Xu, 2021, 2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE), P16, DOI 10.1109/ICECE54449.2021.9674552
[87]  
Le Q.V., 2014, ICML, V14, P1188
[88]   Deep learning application on code clone detection: A review of current knowledge [J].
Lei, Maggie ;
Li, Hao ;
Li, Ji ;
Aundhkar, Namrata ;
Kim, Dae-Kyoo .
JOURNAL OF SYSTEMS AND SOFTWARE, 2022, 184
[89]   Semantic Code Clone Detection Via Event Embedding Tree and GAT Network [J].
Li, Bingzhuo ;
Ye, Chunyang ;
Guan, Shouyang ;
Zhou, Hui .
2020 IEEE 20TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY, AND SECURITY (QRS 2020), 2020, :382-393
[90]  
Li CQ, 2014, INT CONF DIGIT INFO, P363, DOI 10.1109/DICTAP.2014.6821712