Machine Learning Approaches to Code Similarity Measurement: A Systematic Review

被引：3

作者：

Zhang, Zixian ^{[1
]}

Saber, Takfarinas ^{[2
]}

机构：

[1] Univ Galway, Sch Comp Sci, CRT AI, Galway H91TK33, Ireland

[2] Univ Galway, Sch Comp Sci, Lero, Galway H91TK33, Ireland

来源：

IEEE ACCESS | 2025年 / 13卷

基金：

爱尔兰科学基金会;

关键词：

Codes; Cloning; Systematic literature review; Syntactics; Plagiarism; Semantics; Machine learning; Unsupervised learning; Source coding; Machine learning algorithms; Code similarity; code clone; machine learning; deep learning; systematic literature review; GRAPH EDIT DISTANCE; CLONE DETECTION; DETECTION FRAMEWORK; SEMANTIC CODE;

D O I：

10.1109/ACCESS.2025.3553392

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Source code similarity measurement, which involves assessing the degree of difference between code segments, plays a crucial role in various aspects of the software development cycle. These include but are not limited to code quality assurance, code review processes, code plagiarism detection, security, and vulnerability analysis. Despite the increasing application of ML technique in this domain, a comprehensive synthesis of existing methodologies remains lacking. This paper presents a systematic review of Machine Learning techniques applied to code similarity measurement, aiming to illuminate current methodologies and contribute valuable insights to the research community. Following a rigorous systematic review protocol, we identified and analyzed 84 primary studies on a broad spectrum of dimensions covering application type, devised Machine Learning algorithms, used code representations, datasets, and performance metrics, as well as performance evaluations. A deep investigation reveals that 15 applications for code similarity measurement have utilized 51 different machine learning algorithms. Additionally, the most prevalent code representation is found to be the abstract syntax tree (AST). Furthermore, the most frequently employed dataset across various code similarity research applications is BigCloneBench. Through this comprehensive analysis, the paper not only synthesizes existing research but also identifies prevailing limitations and challenges, shedding light on potential avenues for future work.

引用

页码：51729 / 51764

页数：36

共 200 条

[61] FCCA: Hybrid Code Representation for Functional Clone Detection Using Attention Networks [J].

Hua, Wei ;

Sui, Yulei ;

Wan, Yao ;

Liu, Guangzhong ;

Xu, Guandong .

IEEE TRANSACTIONS ON RELIABILITY, 2021, 70 (01) :304-318

[62] Are our clone detectors good enough? An empirical study of code effects by obfuscation [J].

Huang, Weihao ;

Meng, Guozhu ;

Lin, Chaoyang ;

Yan, Qiucun ;

Chen, Kai ;

Ma, Zhuo .

CYBERSECURITY, 2023, 6 (01)

[63]

Husain H, 2020, Arxiv, DOI arXiv:1909.09436

[64] RLFL: A Reinforcement Learning Aggregation Approach for Hybrid Federated Learning Systems Using Full and Ternary Precision [J].

Imani, Hamidreza ;

Anderson, Jeff ;

Farid, Samuel ;

Amirany, Abdolah ;

El-Ghazawi, Tarek .

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2024, 14 (04) :673-687

[65]

Inoue Ryutaro, 2024, 2024 IEEE/ACIS 22nd International Conference on Software Engineering Research, Management and Applications (SERA), P24, DOI 10.1109/SERA61261.2024.10685589

[66]

Jadon S, 2016, 2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), P299, DOI 10.1109/CCAA.2016.7813733

[67]

James G, 2013, SPRINGER TEXTS STAT, V103, P1, DOI [10.1007/978-1-4614-7138-7, 10.1007/978-1-4614-7138-7_1]

[68] Machine Learning Based Recommendation of Method Names: How Far Are We [J].

Jiang, Lin ;

Liu, Hui ;

Jiang, He .

34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, :614-626

[69]

Jiang LX, 2007, PROC INT CONF SOFTW, P96

[70] Hierarchical semantic-aware neural code representation [J].

Jiang, Yuan ;

Su, Xiaohong ;

Treude, Christoph ;

Wang, Tiantian .

JOURNAL OF SYSTEMS AND SOFTWARE, 2022, 191

← 2 3 4 5 6 7 8 9 10 11 →