Machine Learning Approaches to Code Similarity Measurement: A Systematic Review

被引:3
作者
Zhang, Zixian [1 ]
Saber, Takfarinas [2 ]
机构
[1] Univ Galway, Sch Comp Sci, CRT AI, Galway H91TK33, Ireland
[2] Univ Galway, Sch Comp Sci, Lero, Galway H91TK33, Ireland
基金
爱尔兰科学基金会;
关键词
Codes; Cloning; Systematic literature review; Syntactics; Plagiarism; Semantics; Machine learning; Unsupervised learning; Source coding; Machine learning algorithms; Code similarity; code clone; machine learning; deep learning; systematic literature review; GRAPH EDIT DISTANCE; CLONE DETECTION; DETECTION FRAMEWORK; SEMANTIC CODE;
D O I
10.1109/ACCESS.2025.3553392
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Source code similarity measurement, which involves assessing the degree of difference between code segments, plays a crucial role in various aspects of the software development cycle. These include but are not limited to code quality assurance, code review processes, code plagiarism detection, security, and vulnerability analysis. Despite the increasing application of ML technique in this domain, a comprehensive synthesis of existing methodologies remains lacking. This paper presents a systematic review of Machine Learning techniques applied to code similarity measurement, aiming to illuminate current methodologies and contribute valuable insights to the research community. Following a rigorous systematic review protocol, we identified and analyzed 84 primary studies on a broad spectrum of dimensions covering application type, devised Machine Learning algorithms, used code representations, datasets, and performance metrics, as well as performance evaluations. A deep investigation reveals that 15 applications for code similarity measurement have utilized 51 different machine learning algorithms. Additionally, the most prevalent code representation is found to be the abstract syntax tree (AST). Furthermore, the most frequently employed dataset across various code similarity research applications is BigCloneBench. Through this comprehensive analysis, the paper not only synthesizes existing research but also identifies prevailing limitations and challenges, shedding light on potential avenues for future work.
引用
收藏
页码:51729 / 51764
页数:36
相关论文
共 200 条
[1]   Interpreting CodeBERT for Semantic Code Clone Detection [J].
Abid, Shamsa ;
Cai, Xuemeng ;
Jiang, Lingxiao .
PROCEEDINGS OF THE 2023 30TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, APSEC 2023, 2023, :229-238
[2]  
Abu-Mostafa Y.S., 2012, Learning from data: A short course
[3]  
Acampora G., 2015, P IEEE INT C FUZZ SY, P1
[4]   Artificial intelligence and machine learning in finance: A bibliometric review [J].
Ahmed, Shamima ;
Alshater, Muneer M. ;
El Ammari, Anis ;
Hammami, Helmi .
RESEARCH IN INTERNATIONAL BUSINESS AND FINANCE, 2022, 61
[5]  
Al-omari F, 2020, INT WORKS SOFTW CLON, P57, DOI [10.1109/IWSC50091.2020.9047643, 10.1109/iwsc50091.2020.9047643]
[6]  
Allamanis M, 2018, Arxiv, DOI arXiv:1711.00740
[7]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[8]  
[Anonymous], 2016, Google code jam
[9]  
[Anonymous], 2010, P 18 ACM SIGSOFT INT
[10]   CodeBERT for Code Clone Detection: A Replication Study [J].
Arshad, Saad ;
Abid, Shamsa ;
Shamail, Shafay .
2022 IEEE 16TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC 2022), 2022, :39-45