Grey Relational Analysis based k Nearest Neighbor Missing Data Imputation for Software Quality Datasets

被引：10

作者：

Huang, Jianglin ^{[1
]}

Sun, Hongyi ^{[1
]}

机构：

[1] City Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China

来源：

2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2016) | 2016年

关键词：

kNN; imputation; empirical software engineering estimation; missing data; COST ESTIMATION; DATA SETS; SELECTION; INFORMATION;

D O I：

10.1109/QRS.2016.20

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Software quality estimation is important yet difficult in software engineering studies. Historical quality datasets are used to build classification models for estimating fault proneness. However, the missing values in the datasets severely affect the estimation ability and therefore, cause inconclusive decision-making. Among the single imputation approaches, k nearest neighbor (kNN) imputation is popular in empirical studies due to the relatively high accuracy. However, researchers are still calling for the optimal parameter setting of kNN imputation. In this study, a novel grey relational analysis based incomplete-instance kNN imputation is built for software quality data. An evaluation is conducted on four quality datasets with different simulated missingness scenarios to analyze the performance of the proposed imputation. The empirical results show that the proposed approach is superior to traditional kNN imputation and mean imputation in most cases. Moreover, the classification accuracy can be maintained or even improved by using this approach in classification tasks.

引用

页码：86 / 91

页数：6

共 29 条

[1]

[Anonymous], 2008, Guide to Advanced Empirical Software Engineering

[2]

[Anonymous], P INT C IND ENG ENG

[3]

[Anonymous], 2000, J. Official Statistics

[4]

[Anonymous], 2011, P 7 INT C PRED MOD S

[5] Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering [J].

Conversano, Claudio ;

Siciliano, Roberta .

JOURNAL OF CLASSIFICATION, 2009, 26 (03) :361-379

[6] K nearest neighbours with mutual information for simultaneous classification and missing data imputation [J].

Garcia-Laencina, Pedro J. ;

Sancho-Gomez, Jose-Luis ;

Figueiras-Vidal, Anibal R. ;

Verleysen, Michel .

NEUROCOMPUTING, 2009, 72 (7-9) :1483-1493

[7]

Halstead M.H., 1977, OPERATING PROGRAMMIN

[8] An empirical analysis of data preprocessing for machine learning-based software cost estimation [J].

Huang, Jianglin ;

Li, Yan-Fu ;

Xie, Min .

INFORMATION AND SOFTWARE TECHNOLOGY, 2015, 67 :108-127

[9]

Jingzhou Li, 2007, 2007 First International Symposium on Empirical Software Engineering and Measurement, P126

[10] Benchmarking k-nearest neighbour imputation with homogeneous Likert data [J].

Jonsson, Per ;

Wohlin, Claes .

EMPIRICAL SOFTWARE ENGINEERING, 2006, 11 (03) :463-489

← 1 2 3 →