Deep Learning Code Fragments for Code Clone Detection

被引:430
作者
White, Martin [1 ]
Tufano, Michele [1 ]
Vendome, Christopher [1 ]
Poshyvanyk, Denys [1 ]
机构
[1] Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA
来源
2016 31ST IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE) | 2016年
关键词
code clone detection; machine learning; deep learning; neural networks; language models; abstract syntax trees;
D O I
10.1145/2970276.2970326
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend on generic, handcrafted features to represent code fragments. We introduce learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository. Our code analysis supports a framework, which relies on deep learning, for automatically linking patterns mined at the lexical level with patterns mined at the syntactic level. We evaluated our novel learning-based approach for code clone detection with respect to feasibility from the point of view of software maintainers. We sampled and manually evaluated 398 file-and 480 method-level pairs across eight real-world Java systems; 93% of the file-and method-level samples were evaluated to be true positives. Among the true positives, we found pairs mapping to all four clone types. We compared our approach to a traditional structure-oriented technique and found that our learning-based approach detected clones that were either undetected or suboptimally reported by the prominent tool Deckard. Our results affirm that our learning-based approach is suitable for clone detection and a tenable technique for researchers.
引用
收藏
页码:87 / 98
页数:12
相关论文
共 112 条
[81]  
Ossher J., ICSM 11
[82]  
Pascanu R, 2013, ARXIV13126026
[83]  
Pham N., ICSE 09
[84]  
Rahman F., 2012, EMSE, V17
[85]  
Ray B., 2015, CORR
[86]  
Raychev V., PLDI 14
[87]  
RIEGER M, 2005, THESIS
[88]  
Rosenfeld R., 2000, Two decades of statistical language modeling: Where do we go from here, V88
[89]  
Roy C., 2009, SCP, V74
[90]  
Roy J. R., 2007, Queen's School of computing TR, V541, P64