Semantic Clone Detection Based on Code Feature Fusion Learning

被引:1
作者
Zhang, Qianjin [1 ,2 ]
Jin, Dahai [1 ,2 ]
Wang, Yawen [2 ]
Gong, Yunzhan [2 ]
机构
[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China
[2] Guangxi Key Lab Cryptog & Informat Secur, Guilin 541004, Guangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
Code clone detection; code representation learning; code semantic understanding; graph neural network;
D O I
10.1142/S0218194023500249
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code clones are duplicated code snippets that significantly threaten software maintenance and the public corpora of code representation learning. Traditionally, code context and its structure information abstract syntax tree (AST), control flow graph (CFG) are typical representations of source code, and context-based models and structure-based models contributed significantly to the development of code clone detection. In this paper, we present a hybrid embedding model for code clone detection (HEM-CCD), a fusion method of token sequential information and graph-based structure information. We insert tokens' global context information encoded by a bi-directional recurrent neural network into the AST-based graph for comprehensive code semantic representation. Then, feeding the graph into a gated graph neural network we generate code semantic vectors for similarity evaluation. We have implemented our model on two public clone datasets (BigCloneBench and GoogleCodeJam), and the results indicate that HEM-CCD outperforms several state-of-the-art approaches.
引用
收藏
页码:1039 / 1062
页数:24
相关论文
共 42 条
[1]  
Ahmad W. U., ARXIV
[2]   The Adverse Effects of Code Duplication in Machine Learning Models of Code [J].
Allamams, Miltiadis .
PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, :143-153
[3]  
Allamanis M., ARXIV
[4]   Suggesting Accurate Method and Class Names [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Bird, Christian ;
Sutton, Charles .
2015 10TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE 2015) PROCEEDINGS, 2015, :38-49
[5]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[6]   Comparison and evaluation of clone detection tools [J].
Bellon, Stefan ;
Koschke, Rainer ;
Antoniol, Giuliano ;
Krinke, Jens ;
Merlo, Ettore .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2007, 33 (09) :577-591
[7]   Measuring the Efficacy of Code Clone Information in a Bug Localization Task: An Empirical Study [J].
Chatterji, Debarshi ;
Carver, Jeffrey C. ;
Massengill, Beverly ;
Oslin, Jason ;
Kraft, Nicholas A. .
2011 FIFTH INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM 2011), 2011, :20-29
[8]  
Cho Kyunghyun, arXiv
[9]  
Gabel M, 2008, ICSE'08 PROCEEDINGS OF THE THIRTIETH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, P321, DOI 10.1145/1368088.1368132
[10]   Some From Here, Some From There: Cross-Project Code Reuse in GitHub [J].
Gharehyazie, Mohammad ;
Ray, Baishakhi ;
Filkov, Vladimir .
2017 IEEE/ACM 14TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2017), 2017, :291-301