Code2Img: Tree-Based Image Transformation for Scalable Code Clone Detection

被引:4
作者
Hu, Yutao [1 ]
Fang, Yilin [1 ]
Sun, Yifan [1 ]
Jia, Yaru [1 ]
Wu, Yueming [2 ]
Zou, Deqing [1 ]
Jin, Hai [3 ]
机构
[1] Huazhong Univ Sci & Technol, Hubei Engn Res Ctr Big Data Secur, Sch Cyber Sci & Engn, Natl Engn Res Ctr Big Data Technol & Syst,Serv Com, Wuhan 430074, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
[3] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Cluster & Grid Comp Lab, Natl Engn Res Ctr Big Data Technol & Syst,Serv Com, Wuhan 430074, Peoples R China
基金
美国国家科学基金会;
关键词
Code clone; clone detection; scalability; NEURAL-NETWORK; SEARCH; MODEL;
D O I
10.1109/TSE.2023.3295801
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code clone detection is an active research domain of software engineering. There are two core demands for clone detection: scalable detection and complicated clone detection. For scalable detection, existing approaches treat the source code as a text or token sequence and then calculate their similarity. However, the text-based and token-based approaches are difficult to detect complicated clone types due to the lack of consideration of code structure. The methods based on intermediate representations of code can effectively achieve complex clone types detection but are limited by the complexity of representations to be scalable. In this paper, we propose Code2Img, a tree-based code clone detector, which satisfies scalability while detecting complicated clones effectively. Given the source code, we first perform clone filtering by the inverted index to locate the suspected clones. For each suspected clone, we create the adjacency image based on the adjacency matrix of the normalized abstract syntax tree (AST). Then we design an image encoder to highlight the structural details further and refine pixels of the image. Specifically, we employ the Markov model to encode the adjacency image into a state probability image and remove its useless pixels. By this, the original complex tree can be transformed into a one-dimensional vector while preserving the structural feature of the AST. Finally, we detect clones by calculating the Jaccard Similarity of these vectors. We conduct comparative evaluations on effectiveness and scalability with eight other state-of-the-art clone detectors (SourcererCC, NIL, LVMapper, Nicad, Siamese, CCAligner, Deckard, and Yang2018). The experimental results show that Code2Img achieves the best performance among all the comparative tools in terms of both detection effectiveness and scalability. It indicates that Code2Img can be applicable to scalable complicated clone detection.
引用
收藏
页码:4429 / 4442
页数:14
相关论文
共 52 条
  • [1] Cloning by accident: An empirical study of source code cloning across software systems
    Al-Ekram, R
    Kapser, C
    Holt, R
    Godfrey, M
    [J]. 2005 INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING (ISESE), PROCEEDINGS, 2005, : 363 - 372
  • [2] [Anonymous], CLOC COUNT LINES COD
  • [3] [Anonymous], 2014, Javaparser
  • [4] [Anonymous], Bigclonebench
  • [5] [Anonymous], 2022, Ambient Software Evolution Group: IJaDataset 2.0
  • [6] NEURAL NETWORKS FOR FINGERPRINT RECOGNITION
    BALDI, P
    CHAUVIN, Y
    [J]. NEURAL COMPUTATION, 1993, 5 (03) : 402 - 418
  • [7] Comparison and evaluation of clone detection tools
    Bellon, Stefan
    Koschke, Rainer
    Antoniol, Giuliano
    Krinke, Jens
    Merlo, Ettore
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2007, 33 (09) : 577 - 591
  • [8] GLOBAL OPTIMIZATION OF A NEURAL NETWORK-HIDDEN MARKOV MODEL HYBRID
    BENGIO, Y
    DEMORI, R
    FLAMMIA, G
    KOMPE, R
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1992, 3 (02): : 252 - 259
  • [9] Blackducks, about us
  • [10] fossanalytics, About Us