Multilingual training for Software Engineering

被引:31
作者
Ahmed, Toufique [1 ]
Devanbu, Premkumar [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
来源
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022) | 2022年
基金
美国国家科学基金会;
关键词
code summarization; code search; method name prediction; deep learning;
D O I
10.1145/3510003.3510049
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is timeconsuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.
引用
收藏
页码:1443 / 1455
页数:13
相关论文
共 76 条
[51]   Cliff's Delta Calculator: A non-parametric effect size program for two groups of observations [J].
Macbeth, Guillermo ;
Razumiejczyk, Eugenia ;
Daniel Ledesma, Ruben .
UNIVERSITAS PSYCHOLOGICA, 2011, 10 (02) :545-555
[52]  
Mahmud Junayed, 2021, arXiv
[53]   Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks [J].
Mastropaolo, Antonio ;
Scalabrimo, Simone ;
Cooper, Nathan ;
Palacio, David Nader ;
Poshyvanyk, Denys ;
Oliveto, Rocco ;
Bavota, Gabriele .
2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, :336-347
[54]  
Parvez R, 2021, Arxiv, DOI arXiv:2108.11601
[55]  
Perez Daniel, 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), P518, DOI 10.1109/MSR.2019.00078
[56]  
Qi W., 2021, arXiv
[57]  
Ranathunga S., 2021, COMPUTER SCI COMPUTA
[58]   Reassessing Automatic Evaluation Metrics for Code Summarization Tasks [J].
Roy, Devjeet ;
Fakhoury, Sarah ;
Arnaoudova, Venera .
PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21), 2021, :1105-1116
[59]  
Sennrich R, 2016, Arxiv, DOI [arXiv:1508.07909, 10.48550/arXiv.1508.07909]
[60]  
Shi ES, 2022, Arxiv, DOI arXiv:2107.07112