Machine Learning Based Missing Data Imputation in Categorical Datasets

被引:1
作者
Ishaq, Muhammad [1 ]
Zahir, Sana [1 ]
Iftikhar, Laila [1 ]
Bulbul, Mohammad Farhad [2 ]
Rho, Seungmin [3 ]
Lee, Mi Young [4 ]
机构
[1] Univ Agr Peshawar, Inst Comp Sci & Informat Technol, Peshawar 25000, Khyber Pakhtunk, Pakistan
[2] Jashore Univ Sci & Technol, Dept Math, Jashore 7408, Bangladesh
[3] Chung Ang Univ, Dept Ind Secur, Seoul 06974, South Korea
[4] Chung Ang Univ, Dept Res, Seoul 06974, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Data cleansing; missing data imputation; classification; regression and categorical datasets; MULTIPLE IMPUTATION;
D O I
10.1109/ACCESS.2024.3411817
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In order to predict and fill in the gaps in categorical datasets, this research looked into the use of machine learning algorithms. The emphasis was on ensemble models constructed using the Error Correction Output Codes (ECOC) framework, including models based on SVM and KNN as well as a hybrid classifier that combines models based on SVM, KNN, and MLP. Three diverse datasets-the CPU, Hypothyroid, and Breast Cancer datasets-were employed to validate these algorithms. Results indicated that these machine learning techniques provided substantial performance in predicting and completing missing data, with the effectiveness varying based on the specific dataset and missing data pattern. Compared to solo models, ensemble models that made use of the ECOC framework significantly improved prediction accuracy and robustness. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data and the possibility of over-fitting. Subsequent research endeavors ought to evaluate the feasibility and efficacy of deep learning algorithms in the context of the imputation of missing data.
引用
收藏
页码:88332 / 88344
页数:13
相关论文
共 32 条
  • [1] DATA MINING DATA MINING CONCEPTS AND TECHNIQUES
    Agarwal, Shivam
    [J]. 2013 INTERNATIONAL CONFERENCE ON MACHINE INTELLIGENCE AND RESEARCH ADVANCEMENT (ICMIRA 2013), 2013, : 203 - 207
  • [2] A reinforcement learning-based approach for imputing missing data
    Awan, Saqib Ejaz
    Bennamoun, Mohammed
    Sohel, Ferdous
    Sanfilippo, Frank
    Dwivedi, Girish
    [J]. NEURAL COMPUTING & APPLICATIONS, 2022, 34 (12) : 9701 - 9716
  • [3] An introduction to modern missing data analyses
    Baraldi, Amanda N.
    Enders, Craig K.
    [J]. JOURNAL OF SCHOOL PSYCHOLOGY, 2010, 48 (01) : 5 - 37
  • [4] Data imputation and machine learning improve association analysis and genomic prediction for resistance to fish photobacteriosis in the gilthead sea bream
    Bargelloni, Luca
    Tassiello, Oronzo
    Babbucci, Massimiliano
    Ferraresso, Serena
    Franch, Rafaella
    Montanucci, Ludovica
    Carnier, Paolo
    [J]. AQUACULTURE REPORTS, 2021, 20
  • [5] Cheng W.-L., 2020, Frontiers Psychiatry, V11, DOI [10.3389/fpsyt.2020.00673.22, DOI 10.3389/FPSYT.2020.00673.22]
  • [6] Cihan P., 2020, Eskisehir Teknik Universitesi Bilim ve Teknoloji Dergisi B Teorik Bilimler, V8, P336
  • [7] The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing
    Crone, Sven F.
    Lessmann, Stefan
    Stahlbock, Robert
    [J]. EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2006, 173 (03) : 781 - 800
  • [8] Multiple Imputation of Missing Data in Educational Production Functions
    Elasra, Amira
    [J]. COMPUTATION, 2022, 10 (04)
  • [9] Missing Data Imputation in Internet of Things Gateways
    Franca, Cinthya M.
    Couto, Rodrigo S.
    Velloso, Pedro B.
    [J]. INFORMATION, 2021, 12 (10)
  • [10] Principal stratification in causal inference
    Frangakis, CE
    Rubin, DB
    [J]. BIOMETRICS, 2002, 58 (01) : 21 - 29