A unified multi-task learning model for AST-level and token-level code completion

被引:19
作者
Liu, Fang [1 ,2 ]
Li, Ge [1 ,2 ]
Wei, Bolin [1 ,2 ]
Xia, Xin [3 ]
Fu, Zhiyi [1 ,2 ]
Jin, Zhi [1 ,2 ]
机构
[1] Peking Univ, Minist Educ, Key Lab High Confidence Software Technol, Beijing, Peoples R China
[2] Peking Univ, Sch Comp Sci, Beijing, Peoples R China
[3] Huawei, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Code completion; Deep learning; Multi-task learning;
D O I
10.1007/s10664-022-10140-7
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code completion, one of the most useful features in the Integrated Development Environments (IDEs), can accelerate software development by suggesting the next probable tokens based on existing code in real-time. Recent studies have shown that recurrent neural networks based statistical language models can improve the performance of code completion tools through learning from large-scale software repositories. However, most of the existing approaches treat code completion as a single generation task in which the model predicts the value of the tokens or AST nodes based on the contextual source code without considering the syntactic constraints such as the static type information. Besides, the semantic relationships in programs can be very long. Existing recurrent neural networks based language models are not sufficient to model the long-term dependency. In this paper, we tackle the aforementioned limitations by building a unified multi-task learning based code completion model for both AST-level and token-level code completion. To model the relationship and constraints between the type and value of the code elements, we adopt a multi-task learning framework to predict the type and value of the tokens (AST nodes) simultaneously. To capture the long-term dependency in the input programs, we employ a self-attentional architecture based network as the base language model. We apply our approach to both AST-level and token-level code completion. Experimental results demonstrate the effectiveness of our model when compared with state-of-the-art methods.
引用
收藏
页数:38
相关论文
共 57 条
[1]  
Abadi M, 2016, ACM SIGPLAN NOTICES, V51, P1, DOI [10.1145/2951913.2976746, 10.1145/3022670.2976746]
[2]   Graph-based Statistical Language Model for Code [J].
Anh Tuan Nguyen ;
Nguyen, Tien N. .
2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 1, 2015, :858-868
[3]  
[Anonymous], 2016, Multi-task sequence to sequence learning
[4]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[5]  
Bhoopchand A., 2016, Learning python code suggestion with a sparse pointer network
[6]  
Bielik P, 2016, PR MACH LEARN RES, V48
[7]   Learning from Examples to Improve Code Completion Systems [J].
Bruch, Marcel ;
Monperrus, Martin ;
Mezini, Mira .
7TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2009, :213-222
[8]   Multitask learning [J].
Caruana, R .
MACHINE LEARNING, 1997, 28 (01) :41-75
[9]  
Chelba C., 1998, 36 ANN M ASS COMPUTA, V1, P225
[10]  
Chelba C, 1997, 5 EUR C SPEECH COMM