Categorical missing data imputation for software cost estimation by multinomial logistic regression

被引:45
作者
Sentas, P [1 ]
Angelis, L [1 ]
机构
[1] Aristotle Univ Thessaloniki, Dept Informat, Thessaloniki 54124, Greece
关键词
software effort prediction; cost estimation; missing data; imputation; multinomial logistic regression;
D O I
10.1016/j.jss.2005.02.026
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A common problem in software cost estimation is the manipulation of incomplete or missing data in databases used for the development of prediction models. In such cases, the most popular and simple method of handling missing data is to ignore either the projects or the attributes with missing observations. This technique causes the loss of valuable information and therefore may lead to inaccurate cost estimation models. On the other hand, there are various imputation methods used to estimate the missing values in a data set. These methods are applied mainly on numerical data and produce continuous estimates. However, it is well known that the majority of the cost data sets contain software projects with mostly categorical attributes with many missing values. It is therefore reasonable to use some estimating method producing categorical rather than continuous values. The purpose of this paper is to investigate the possibility of using such a method for estimating categorical missing values in software cost databases. Specifically, the method known as multinomial logistic regression (MLR) is suggested for imputation and is applied on projects of the ISBSG multi-organizational software database. Comparisons of NILR with other techniques for handling missing data, such as listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI) under different patterns and percentages of missing data, show the high efficiency of the proposed method. (C) 2005 Elsevier Inc. All rights reserved.
引用
收藏
页码:404 / 414
页数:11
相关论文
共 50 条
  • [31] Data Imputation for Symbolic Regression with Missing Values: A Comparative Study
    Al-Helali, Baligh
    Chen, Qi
    Xue, Bing
    Zhang, Mengjie
    2020 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2020, : 2093 - 2100
  • [32] Regression-based imputation of explanatory discrete missing data
    Hernandez-Herrera, Gilma
    Navarro, Albert
    Morina, David
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024, 53 (09) : 4363 - 4379
  • [33] A comprehensive empirical evaluation of missing value imputation in noisy software measurement data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    JOURNAL OF SYSTEMS AND SOFTWARE, 2008, 81 (05) : 691 - 708
  • [34] Application of Multiple Imputation Method for Missing Data Estimation
    Ser, Gazel
    GAZI UNIVERSITY JOURNAL OF SCIENCE, 2012, 25 (04): : 869 - 873
  • [35] Support vector regression-based imputation in analogy-based software development effort estimation
    Idri, Ali
    Abnane, Ibtissam
    Abran, Alain
    JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2018, 30 (12)
  • [36] Estimation and imputation in linear regression with missing values in both response and covariate
    Shao, Jun
    STATISTICS AND ITS INTERFACE, 2013, 6 (03) : 361 - 368
  • [37] Software cost estimation with incomplete data
    Strike, K
    El Emam, K
    Madhavji, N
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2001, 27 (10) : 890 - 908
  • [38] Grey Relational Analysis based k Nearest Neighbor Missing Data Imputation for Software Quality Datasets
    Huang, Jianglin
    Sun, Hongyi
    2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2016), 2016, : 86 - 91
  • [39] Multinomial logistic regression-based feature selection for hyperspectral data
    Pal, Mahesh
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2012, 14 (01): : 214 - 220
  • [40] Estimation of parameters of logistic regression with covariates missing separately or simultaneously
    Tran, Phuoc-Loc
    Le, Truong-Nhat
    Lee, Shen-Ming
    Li, Chin-Shang
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2023, 52 (06) : 1981 - 2009