Improving Cross-Language Code Clone Detection via Code Representation Learning and Graph Neural Networks

被引:8
作者
Mehrotra, Nikita [1 ]
Sharma, Akash [1 ]
Jindal, Anmol [1 ]
Purandare, Rahul [2 ]
机构
[1] IIIT Delhi, Dept Comp Sci & Engn, Delhi 110020, India
[2] Univ Nebraska Lincoln, Lincoln, NE 68588 USA
关键词
Codes; Cloning; Syntactics; Semantics; !text type='Java']Java[!/text; Task analysis; Source coding; Program representation learning; cross-language code clone detection; graph-based neural networks; abstract syntax trees;
D O I
10.1109/TSE.2023.3311796
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code clone detection is an important aspect of software development and maintenance. The extensive research in this domain has helped reduce the complexity and increase the robustness of source code, thereby assisting bug detection tools. However, the majority of the clone detection literature is confined to a single language. With the increasing prevalence of cross-platform applications, functionality replication across multiple languages is common, resulting in code fragments having similar functionality but belonging to different languages. Since such clones are syntactically unrelated, single language clone detection tools are not applicable in their case. In this article, we propose a semi-supervised deep learning-based tool Rubhus, capable of detecting clones across different programming languages. Rubhus uses the control and data flow enriched abstract syntax trees (ASTs) of code fragments to leverage their syntactic and structural information and then applies graph neural networks (GNNs) to extract this information for the task of clone detection. We demonstrate the effectiveness of our proposed system through experiments conducted over datasets consisting of Java, C, and Python programs and evaluate its performance in terms of precision, recall, and F1 score. Our results indicate that Rubhus outperforms the state-of-the-art cross-language clone detection tools.
引用
收藏
页码:4846 / 4868
页数:23
相关论文
共 76 条
[1]  
Adiabatic Temperature Calculator (ATC) program, US
[2]  
Al-Omari F., 2012, 2012 19th Working Conference on Reverse Engineering (WCRE), P405, DOI 10.1109/WCRE.2012.50
[3]  
Allamanis M, 2018, Arxiv, DOI arXiv:1711.00740
[4]  
Allamanis M, 2016, PR MACH LEARN RES, V48
[5]   Clone detection using abstract syntax trees [J].
Baxter, ID ;
Yahin, A ;
Moura, L ;
Sant'Anna, M ;
Bier, L .
INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, PROCEEDINGS, 1998, :368-377
[6]  
Bromley J., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P669, DOI 10.1142/S0218001493000339
[7]   Geometric Deep Learning Going beyond Euclidean data [J].
Bronstein, Michael M. ;
Bruna, Joan ;
LeCun, Yann ;
Szlam, Arthur ;
Vandergheynst, Pierre .
IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (04) :18-42
[8]  
Budimac Z., 2012, Proceedings of the Fifth Balkan Conference in Informatics, BCI'12, P287, DOI DOI 10.1145/2371316.2371380
[9]  
Burges C. J. C., 2013, 27 ANN C NEUR INF P
[10]  
Chen X., 2020, INT C LEARN REPR