Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST

被引:17
作者
Chen, Long [1 ]
Ye, Wei [1 ]
Zhang, Shikun [1 ]
机构
[1] Peking Univ, Beijing, Peoples R China
来源
CF '19 - PROCEEDINGS OF THE 16TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS | 2019年
关键词
tree-based convolution; tree-based LSTM; representation learning; big code; code semantics; AST; API; semantic clone; clone detection; code search; code summarization; CLONE DETECTION;
D O I
10.1145/3310273.3321560
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
When deep learning meets big code, a key question is how to efficiently learn a distributed representation for source code that can capture its semantics effectively. We propose to use tree-based convolution over API-enhanced AST. To demonstrate the effectiveness of our approach, we apply it to detect semantic clones-code fragments with similar semantics but dissimilar syntax. Experiment results show that our approach outperforms an existing state-of-the-art approach that uses tree-based LSTM, with an increase of 0.39 and 0.12 in F1-score on OJClone and BigCloneBench respectively. We further propose architectures that incorporate our approach for code search and code summarization.
引用
收藏
页码:174 / 182
页数:9
相关论文
共 34 条
[1]   Learning Natural Coding Conventions [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Bird, Christian ;
Sutton, Charles .
22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, :281-293
[2]  
Allamanis Miltiadis, 2017, ABS170906182 ARXIV
[3]  
Alon Uri, 2018, ABS180801400 CORR
[4]   Comparison and evaluation of clone detection tools [J].
Bellon, Stefan ;
Koschke, Rainer ;
Antoniol, Giuliano ;
Krinke, Jens ;
Merlo, Ettore .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2007, 33 (09) :577-591
[5]   LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].
BENGIO, Y ;
SIMARD, P ;
FRASCONI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166
[6]  
Chaturvedi A, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, P272
[7]   Decoding the representation of code in the brain: An fMRI study of code review and expertise [J].
Floyd, Benjamin ;
Santander, Tyler ;
Weimer, Westley .
2017 IEEE/ACM 39TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2017, :175-186
[8]  
Gabel Mark, 2010, P 28 TH ACM SIGSOFT, P147
[9]  
Gehring J, 2017, PR MACH LEARN RES, V70
[10]   Deep API Learning [J].
Gu, Xiaodong ;
Zhang, Hongyu ;
Zhang, Dongmei ;
Kim, Sunghun .
FSE'16: PROCEEDINGS OF THE 2016 24TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON FOUNDATIONS OF SOFTWARE ENGINEERING, 2016, :631-642