Estimating missing data using novel correlation maximization based methods

被引:16
|
作者
Sefidian, Amir Masoud [1 ]
Daneshpour, Negin [1 ]
机构
[1] Shahid Rajaee Teacher Training Univ, Fac Comp Engn, Tehran, Iran
关键词
Missing values; Imputation; Correlation; Regression; FUZZY C-MEANS; K-NEAREST NEIGHBORS; VALUE IMPUTATION; GENETIC ALGORITHM; VALUES; CLASSIFICATION; REGRESSION; FRAMEWORK; SELECTION; PATTERNS;
D O I
10.1016/j.asoc.2020.106249
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The accurate estimation of missing data plays a vital role in ensuring a high level of data quality. The missing values should be imputed before performing data mining, machine learning, and other data processing tasks. Ten correlation-based imputation methods are proposed in this paper. All of these methods try to maximize the correlation between a missing feature and other features. The maximization is achieved by selecting segments of data that have strong correlations. The proposed approach involves the following main steps to impute each missing instance. First, a base set is selected from complete instances. Second, data segments with strong correlations are generated using the base set and the rest of the complete instances. Finally, each missing value is imputed by applying linear models to the discovered segments of data. This study considers seven real datasets from different fields with different missing rates. The imputation quality of the proposed methods is compared to those of seven other imputation approaches in terms of three well-known evaluation criteria. The experimental results reveal that the proposed approach has better imputation performance than competing imputation techniques in most cases. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:30
相关论文
共 50 条
  • [1] A novel model to optimize multiple imputation algorithm for missing data using evolution methods
    Mohammed, Yasser Salaheldin
    Abdelkader, Hatem
    Plawiak, Pawel
    Hammad, Mohamed
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 76
  • [2] Handing incomplete and missing data in water network database using imputation methods
    Kabir, Golam
    Tesfamariam, Solomon
    Hemsing, Jordi
    Sadiq, Rehan
    SUSTAINABLE AND RESILIENT INFRASTRUCTURE, 2020, 5 (06) : 365 - 377
  • [3] Evaluation of Statistical Methods for Estimating Missing Daily Streamflow Data
    Yilmaz, Mustafa Utku
    Onoz, Bihrat
    TEKNIK DERGI, 2019, 30 (06): : 9597 - 9620
  • [4] Estimating missing reference evapotranspiration data by correlation analysis
    Eching, SO
    PROCEEDINGS OF THE IVTH INTERNATIONAL SYMPOSIUM ON IRRIGATION OF HORTICULTURAL CROPS, 2004, (664): : 181 - 187
  • [5] A Review of Missing Data Handling Methods in Education Research
    Cheema, Jehanzeb R.
    REVIEW OF EDUCATIONAL RESEARCH, 2014, 84 (04) : 487 - 508
  • [6] NMVI: A data-splitting based imputation technique for distinct types of missing data
    Bhagat, Hutashan Vishal
    Singh, Manminder
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2022, 223
  • [7] Estimating propensity scores with missing covariate data using general location mixture models
    Mitra, Robin
    Reiter, Jerome P.
    STATISTICS IN MEDICINE, 2011, 30 (06) : 627 - 641
  • [8] Simple methods to handle missing data
    Bici, Ruzhdie
    INTERNATIONAL JOURNAL OF COMPUTATIONAL ECONOMICS AND ECONOMETRICS, 2023, 13 (02) : 216 - 242
  • [9] An Integrated Fuzzy C-Means Method for Missing Data Imputation Using Taxi GPS Data
    Huang, Junsheng
    Mao, Baohua
    Bai, Yun
    Zhang, Tong
    Miao, Changjun
    SENSORS, 2020, 20 (07)
  • [10] Imputation methods of missing data for estimating the population mean using simple random sampling with known correlation coefficient
    Al-Omari, Amer Ibrahim
    Bouza, Carlos N.
    Herrera, Carmelo
    QUALITY & QUANTITY, 2013, 47 (01) : 353 - 365