Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

被引:141
作者
Karampatsis, Rafael-Michael [1 ]
Babii, Hlib [2 ]
Robbes, Romain [2 ]
Sutton, Charles [1 ,3 ]
Janes, Andrea [2 ]
机构
[1] Univ Edinburgh, Edinburgh, Midlothian, Scotland
[2] Free Univ Bozen Bolzano, Bozen Bolzano, Italy
[3] Google Res, Mountain View, CA USA
来源
2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020) | 2020年
基金
英国工程与自然科学研究理事会;
关键词
Naturalness of code; Neural Language Models; Byte-Pair Encoding;
D O I
10.1145/3377811.3380342
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.
引用
收藏
页码:1073 / 1085
页数:13
相关论文
共 89 条
[1]   The Adverse Effects of Code Duplication in Machine Learning Models of Code [J].
Allamams, Miltiadis .
PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, :143-153
[2]   A Survey of Machine Learning for Big Code and Naturalness [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Devanbu, Premkumar ;
Sutton, Charles .
ACM COMPUTING SURVEYS, 2018, 51 (04)
[3]  
Allamanis M, 2016, PR MACH LEARN RES, V48
[4]  
Allamanis M, 2015, PR MACH LEARN RES, V37, P2123
[5]   Suggesting Accurate Method and Class Names [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Bird, Christian ;
Sutton, Charles .
2015 10TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE 2015) PROCEEDINGS, 2015, :38-49
[6]   Learning Natural Coding Conventions [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Bird, Christian ;
Sutton, Charles .
22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, :281-293
[7]  
Allamanis M, 2013, IEEE WORK CONF MIN S, P207, DOI 10.1109/MSR.2013.6624029
[8]  
Alon U., 2019, INT C LEARNING REPRE
[9]  
[Anonymous], 2016, ARXIV PREPRINT ARXIV
[10]  
[Anonymous], 2007, ACM Transactions on Speech and Language Processing (TSLP), DOI DOI 10.1145/1322391.1322394