Detecting Semantic Code Clones by Building AST-based Markov Chains Model

被引:14
作者
Wu, Yueming [1 ,2 ,3 ]
Feng, Siyue [1 ,2 ,3 ]
Zou, Deqing [1 ,2 ,3 ]
Jin, Hai [1 ,3 ,4 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] HUST, Sch Cyber Sci & Engn, Hubei Engn Res Ctr Big Data Secur, Wuhan 430074, Peoples R China
[3] HUST, Serv Comp Technol & Syst Lab, Natl Engn Res Ctr Big Data Technol & Syst, Wuhan 430074, Peoples R China
[4] HUST, Sch Comp Sci & Technol, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China
来源
PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022 | 2022年
基金
美国国家科学基金会;
关键词
Semantic Code Clones; Abstract Syntax Tree; Markov Chain; NEURAL-NETWORK;
D O I
10.1145/3551349.3560426
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Code clone detection aims to find functionally similar code fragments, which is becoming more and more important in the field of software engineering. Many code clone detection methods have been proposed, among which tree-based methods are able to handle semantic code clones. However, these methods are difficult to scale to big code due to the complexity of tree structures. In this paper, we design Amain, a scalable tree-based semantic code clone detector by building Markov chains models. Specifically, we propose a novel method to transform the original complex tree into simple Markov chains and measure the distance of all states in these chains. After obtaining all distance values, we feed them into a machine learning classifier to train a code clone detector. To examine the effectiveness of Amain, we evaluate it on two widely used datasets namely Google Code Jam and BigCloneBench. Experimental results show that Amain is superior to nine state-of-the-art code clone detection tools (i.e., SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, FCCA, DeepSim, and SCDetector).
引用
收藏
页数:13
相关论文
共 50 条
  • [1] [Anonymous], 2022, pycparser is a complete parser of the C language
  • [2] [Anonymous], 2022, An open source machine learning library that supports supervised and unsupervised learning
  • [3] [Anonymous], 2017, Google Code Jam
  • [4] [Anonymous], 2022, BigCloneBench
  • [5] [Anonymous], 2022, A pure Python library for working with Java source code, provies a lexer and parser targeting Java 8
  • [6] Balazinska M., 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403), P292, DOI 10.1109/METRIC.1999.809750
  • [7] Comparison and evaluation of clone detection tools
    Bellon, Stefan
    Koschke, Rainer
    Antoniol, Giuliano
    Krinke, Jens
    Merlo, Ettore
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2007, 33 (09) : 577 - 591
  • [8] GLOBAL OPTIMIZATION OF A NEURAL NETWORK-HIDDEN MARKOV MODEL HYBRID
    BENGIO, Y
    DEMORI, R
    FLAMMIA, G
    KOMPE, R
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1992, 3 (02): : 252 - 259
  • [9] Haskell Clone Detection using Pattern Comparing Algorithm
    Chodarev, Sergej
    Pietrikova, Emilia
    Kollar, Jan
    [J]. 2015 13TH INTERNATIONAL CONFERENCE ON ENGINEERING OF MODERN ELECTRIC SYSTEMS (EMES), 2015,
  • [10] Ducasse S., 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). `Software Maintenance for Business Change' (Cat. No.99CB36360), P109, DOI 10.1109/ICSM.1999.792593