Using Bayesian regression and EM algorithm with missing handling for software effort prediction

被引：37

作者：

Zhang, Wen ^{[1
]}

Yang, Ye ^{[2
]}

Wang, Qing ^{[3
]}

机构：

[1] Beijing Univ Chem Technol, Sch Econ & Management, Beijing 100029, Peoples R China

[2] Stevens Inst Technol, Sch Syst & Enterprises, Hoboken, NJ 07030 USA

[3] Chinese Acad Sci, Inst Software, Lab Internet Software Technol, Beijing 100190, Peoples R China

来源：

INFORMATION AND SOFTWARE TECHNOLOGY | 2015年 / 58卷

基金：

中国国家自然科学基金;

关键词：

Bayesian regression; EM algorithm; Missing imputation; Software effort prediction; PROJECT DATA SETS; COST ESTIMATION; TEXT CLASSIFICATION; MAXIMUM-LIKELIHOOD; INCOMPLETE DATA; MACHINE; MODELS;

D O I：

10.1016/j.infsof.2014.10.005

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Context: Although independent imputation techniques are comprehensively studied in software effort prediction, there are few studies on embedded methods in dealing with missing data in software effort prediction. Objective: We propose BREM (Bayesian Regression and Expectation Maximization) algorithm for software effort prediction and two embedded strategies to handle missing data. Method: The MDT (Missing Data Toleration) strategy ignores the missing data when using BREM for software effort prediction and the MDI (Missing Data Imputation) strategy uses observed data to impute missing data in an iterative manner while elaborating the predictive model. Results: Experiments on the ISBSG and CSBSG datasets demonstrate that when there are no missing values in historical dataset, BREM outperforms LR (Linear Regression), BR (Bayesian Regression), SVR (Support Vector Regression) and M5' regression tree in software effort prediction on the condition that the test set is not greater than 30% of the whole historical dataset for ISBSG dataset and 25% of the whole historical dataset for CSBSG dataset. When there are missing values in historical datasets, BREM with the MDT and MDI strategies significantly outperforms those independent imputation techniques, including MI, BMI, CMI, MINI and M5'. Moreover, the MDI strategy provides BREM with more accurate imputation for the missing values than those given by the independent missing imputation techniques on the condition that the level of missing data in training set is not larger than 10% for both ISBSG and CSBSG datasets. Conclusion: The experimental results suggest that BREM is promising in software effort prediction. When there are missing values, the MDI strategy is preferred to be embedded with BREM. (C) 2014 Elsevier B.V. All rights reserved.

引用

页码：58 / 70

页数：13

共 44 条

[1] MAXIMUM-LIKELIHOOD ESTIMATES FOR A MULTIVARIATE NORMAL-DISTRIBUTION WHEN SOME OBSERVATIONS ARE MISSING [J].

ANDERSON, TW .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1957, 52 (278) :200-203

[2]

[Anonymous], 2008, Guide to Advanced Empirical Software Engineering

[3]

[Anonymous], THESIS STANFORD U

[4]

[Anonymous], 2014, C4. 5: programs for machine learning

[5]

[Anonymous], 1981, Software Engineering Economics

[6]

[Anonymous], 1998, Expectation-Maximization as lower bound maximization"

[7]

Arbucle J.L., 1996, ADV STRUCTURAL EQUAT

[8]

Bayes T., 1763, PHILOS T ROY SOC LON, V53, P370

[9]

Cozman F.G., 2003, P 20 INT C MACHINE L, P99

[10] Data Mining Techniques for Software Effort Estimation: A Comparative Study [J].

Dejaeger, Karel ;

Verbeke, Wouter ;

Martens, David ;

Baesens, Bart .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2012, 38 (02) :375-397

← 1 2 3 4 5 →