Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization

被引:29
作者
Shi, Lin [1 ,6 ,7 ]
Mu, Fangwen [1 ,7 ]
Chen, Xiao [1 ]
Wang, Song [2 ]
Wang, Junjie [1 ,6 ]
Yang, Ye [3 ]
Li, Ge [4 ]
Xia, Xin [5 ]
Wang, Qing [1 ,8 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
[2] York Univ, Lassonde Sch Engn, Toronto, ON, Canada
[3] Stevens Inst Technol, Sch Syst & Enterprises, Hoboken, NJ USA
[4] Peking Univ, Key Lab High Confidence Software Technol, Beijing, Peoples R China
[5] Huawei, Software Engn Applicat Technol Lab, Beijing, Peoples R China
[6] Chinese Acad Sci, Inst Software, Lab Internet Software Technol, Beijing, Peoples R China
[7] Univ Chinese Acad Sci, Beijing, Peoples R China
[8] Chinese Acad Sci, Inst Software, Sci Technol Integrated Informat Syst Lab, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022 | 2022年
基金
美国国家科学基金会;
关键词
Code Summarization; Data Quality; Empirical Study; GENERATION;
D O I
10.1145/3540250.3549145
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.
引用
收藏
页码:107 / 119
页数:13
相关论文
共 81 条
[1]  
Ahmad W. U., 2020, P 58 ANN M ASS COMPU, P4998, DOI 10.18653/v1/2020.acl-main.449
[2]   The Adverse Effects of Code Duplication in Machine Learning Models of Code [J].
Allamams, Miltiadis .
PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, :143-153
[3]  
Alon Uri, 2019, P 7 INT C LEARNING R
[4]  
[Anonymous], 2018, TLC Dataset Download
[5]  
[Anonymous], 2020, SIGSOFT Open Science Policies
[6]  
[Anonymous], 2022, Project Website.
[7]  
[Anonymous], 2022, CAT Python Library.
[8]  
[Anonymous], 2017, PCSD Dataset Download
[9]  
[Anonymous], 2019, CSN Dataset Download
[10]  
[Anonymous], 2019, Funcom Dataset