The Trade-off Between Data Volume and Quality in Predicting User Satisfaction in Software Projects

被引：0

作者：

Radlinski, Lukasz ^{[1
]}

机构：

[1] West Pomeranian Univ Technol Szczecin, Fac Comp Sci & Informat Technol, Szczecin, Poland

来源：

2024 50TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS, SEAA 2024 | 2024年

关键词：

software projects; user satisfaction; prediction; data quality; data volume; machine learning; ISBSG; MODELS;

D O I：

10.1109/SEAA64295.2024.00080

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Most predictive studies involving the ISBSG dataset used only high-quality cases according to the Data Quality Rating and UFP Rating and a few predictors with no or very few missing values. This study investigated the trade-off between data volume and quality when predicting user satisfaction in software projects. Specifically, it explored whether machine learning models would perform better when trained using a larger dataset containing some portion of low-quality data, a smaller dataset with only high-quality data, or an intermediate setting. A standardised accuracy, a "win-tie-loss" approach, and a matched-pairs rank biserial correlation coefficient were used to evaluate predictive performance. The rankings of data selection strategies for particular models were created using the Scott-Knott Effect Size Difference test. The robustness of results was assessed using Kendall W. For most models, a higher predictive accuracy was achieved when trained on a larger subset, even though it contained some low-quality data. For most models, data selection strategies were robust to data splits. The ranks of data selection strategies were stable across models. Hence, a practical recommendation for predicting user satisfaction, especially when a dataset is small, is to train predictive models on a relatively high-volume subset despite some low-quality data. Provided rankings may be helpful when setting up future experiments on user satisfaction with the ISBSG dataset.

引用

页码：483 / 490

页数：8

共 53 条

[1] AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION [J].

ALTMAN, NS .

AMERICAN STATISTICIAN, 1992, 46 (03) :175-185

[2]

[Anonymous], 1997, EUR C MACH LEARN

[3] User satisfaction and system success: an empirical exploration of user involvement in software development [J].

Bano, Muneera ;

Zowghi, Didar ;

da Rimini, Francesca .

EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (05) :2339-2372

[4] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].

BENJAMINI, Y ;

HOCHBERG, Y .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300

[5] Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation [J].

Bosu, Michael F. ;

Macdonell, Stephen G. .

ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2019, 11 (04)

[6] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[7]

Breiman L, 1984, Classification and Regression Trees, V1st, DOI DOI 10.1201/9781315139470

[8] LIBSVM: A Library for Support Vector Machines [J].

Chang, Chih-Chung ;

Lin, Chih-Jen .

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)

[9] XGBoost: A Scalable Tree Boosting System [J].

Chen, Tianqi ;

Guestrin, Carlos .

KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794

[10] Data Mining Techniques for Software Effort Estimation: A Comparative Study [J].

Dejaeger, Karel ;

Verbeke, Wouter ;

Martens, David ;

Baesens, Bart .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2012, 38 (02) :375-397

← 1 2 3 4 5 6 →