The Trade-off Between Data Volume and Quality in Predicting User Satisfaction in Software Projects

被引:0
作者
Radlinski, Lukasz [1 ]
机构
[1] West Pomeranian Univ Technol Szczecin, Fac Comp Sci & Informat Technol, Szczecin, Poland
来源
2024 50TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS, SEAA 2024 | 2024年
关键词
software projects; user satisfaction; prediction; data quality; data volume; machine learning; ISBSG; MODELS;
D O I
10.1109/SEAA64295.2024.00080
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Most predictive studies involving the ISBSG dataset used only high-quality cases according to the Data Quality Rating and UFP Rating and a few predictors with no or very few missing values. This study investigated the trade-off between data volume and quality when predicting user satisfaction in software projects. Specifically, it explored whether machine learning models would perform better when trained using a larger dataset containing some portion of low-quality data, a smaller dataset with only high-quality data, or an intermediate setting. A standardised accuracy, a "win-tie-loss" approach, and a matched-pairs rank biserial correlation coefficient were used to evaluate predictive performance. The rankings of data selection strategies for particular models were created using the Scott-Knott Effect Size Difference test. The robustness of results was assessed using Kendall W. For most models, a higher predictive accuracy was achieved when trained on a larger subset, even though it contained some low-quality data. For most models, data selection strategies were robust to data splits. The ranks of data selection strategies were stable across models. Hence, a practical recommendation for predicting user satisfaction, especially when a dataset is small, is to train predictive models on a relatively high-volume subset despite some low-quality data. Provided rankings may be helpful when setting up future experiments on user satisfaction with the ISBSG dataset.
引用
收藏
页码:483 / 490
页数:8
相关论文
共 53 条
[1]   AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION [J].
ALTMAN, NS .
AMERICAN STATISTICIAN, 1992, 46 (03) :175-185
[2]  
[Anonymous], 1997, EUR C MACH LEARN
[3]   User satisfaction and system success: an empirical exploration of user involvement in software development [J].
Bano, Muneera ;
Zowghi, Didar ;
da Rimini, Francesca .
EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (05) :2339-2372
[4]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[5]   Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation [J].
Bosu, Michael F. ;
Macdonell, Stephen G. .
ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2019, 11 (04)
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]  
Breiman L, 1984, Classification and Regression Trees, V1st, DOI DOI 10.1201/9781315139470
[8]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[9]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[10]   Data Mining Techniques for Software Effort Estimation: A Comparative Study [J].
Dejaeger, Karel ;
Verbeke, Wouter ;
Martens, David ;
Baesens, Bart .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2012, 38 (02) :375-397