COCLUBERT: Clustering Machine Learning Source Code

被引:2
作者
Hagglund, Marcus [1 ]
Pena, Francisco J. [1 ]
Pashami, Sepideh [2 ]
Al-Shishtawy, Ahmad [2 ]
Payberah, Amir H. [1 ,2 ]
机构
[1] KTH Royal Inst Technol, Stockholm, Sweden
[2] RISE Res Inst Sweden, Stockholm, Sweden
来源
20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021) | 2021年
关键词
Source Code Clustering; NLP; BERT; CuBERT;
D O I
10.1109/ICMLA52953.2021.00031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, we can find machine learning (ML) applications in nearly every aspect of modern life, and we see that more developers are engaged in the field than ever. In order to facilitate the development of new ML applications, it would be beneficial to provide services that enable developers to share, access, and search for source code easily. A step towards making such a service is to cluster source code by functionality. In this work, we present COCLUBERT, a BERT-based model for source code embedding based on their functionality and clustering them accordingly. We build COCLUBERT using CuBERT, a variant of BERT pre-trained on source code, and present three ways to fine-tune it for the clustering task. In the experiments, we compare COCLUBERT with a baseline model, where we cluster source code using CuBERT embedding without fine-tuning. We show that COCLUBERT significantly outperforms the baseline model by increasing the Dunn Index metric by a factor of 141, the Silhouette Score metric by a factor of two, and the Adjusted Rand Index metric by a factor of 11.
引用
收藏
页码:151 / 158
页数:8
相关论文
共 46 条
[1]  
Alon U., 2018, P INT C LEARN REPR
[2]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[3]  
Ankerst M, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P49
[4]  
[Anonymous], 2015, P ANN C NEURAL INFOR
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]   When Deep Learning Met Code Search [J].
Cambronero, Jose ;
Li, Hongyu ;
Kim, Seohyun ;
Sen, Koushik ;
Chandra, Satish .
ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, :964-974
[7]   Regularizing RNNs for Caption Generation by Reconstructing The Past with The Present [J].
Chen, Xinpeng ;
Ma, Lin ;
Jiang, Wenhao ;
Yao, Jian ;
Liu, Wei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7995-8003
[8]   Learning a similarity metric discriminatively, with application to face verification [J].
Chopra, S ;
Hadsell, R ;
LeCun, Y .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :539-546
[9]  
Cohan A, 2019, ARXIV PREPRINT ARXIV
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171