Data Imputation for Symbolic Regression with Missing Values: A Comparative Study

被引:0
作者
Al-Helali, Baligh [1 ]
Chen, Qi [1 ]
Xue, Bing [1 ]
Zhang, Mengjie [1 ]
机构
[1] Victoria Univ Wellington, Sch Engn & Comp Sci, POB 600, Wellington 6140, New Zealand
来源
2020 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI) | 2020年
关键词
symbolic regression; genetic programming; incomplete data; imputation; PREDICTOR SELECTION;
D O I
10.1109/ssci47803.2020.9308216
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Symbolic regression via genetic programming is considered as a crucial machine learning tool for empirical modelling. However, in reality, it is common for real-world data sets to have some data quality problems such as noise, outliers, and missing values. Although several approaches can he adopted to deal with data incompleteness in machine learning, most studies consider the classification tasks, and only a few have considered symbolic regression with missing values. In this work, the performance of symbolic regression using genetic programming on real-world data sets that have missing values is investigated. This is done by studying how different imputation methods affect symbolic regression performance. The experiments are conducted using thirteen real-world incomplete data sets with different ratios of missing values. The experimental results show that although the performance of the imputation methods differs with the data set, CART has a better effect than others. This might be due to its ability to deal with categorical and numerical variables. Moreover, the superiority of the use of imputation methods over the commonly used deletion strategy is observed.
引用
收藏
页码:2093 / 2100
页数:8
相关论文
共 31 条
  • [1] Al Helali B., 2019, AS C PATT REC, P566
  • [2] Al-Helali B, 2020, IEEE C EVOL COMPUTAT
  • [3] Al-Helali B, 2020, IEEE C EVOL COMPUTAT
  • [4] Multi-Tree Genetic Programming for Feature Construction-Based Domain Adaptation in Symbolic Regression with Incomplete Data
    Al-Helali, Baligh
    Chen, Qi
    Xue, Bing
    Zhang, Mengjie
    [J]. GECCO'20: PROCEEDINGS OF THE 2020 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2020, : 913 - 921
  • [5] Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data
    Al-Helali, Baligh
    Chen, Qi
    Xue, Bing
    Zhang, Mengjie
    [J]. GENETIC PROGRAMMING, EUROGP 2020, 2020, 12101 : 1 - 17
  • [6] Al-Helali B, 2019, 2019 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2019), P2395, DOI 10.1109/SSCI44817.2019.9002861
  • [7] Genetic Programming for Imputation Predictor Selection and Ranking in Symbolic Regression with High-Dimensional Incomplete Data
    Al-Helali, Baligh
    Chen, Qi
    Xue, Bing
    Zhang, Mengjie
    [J]. AI 2019: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11919 : 523 - 535
  • [8] A Hybrid GP-KNN Imputation for Symbolic Regression with Missing Values
    Al-Helali, Baligh
    Chen, Qi
    Xue, Bing
    Zhang, Mengjie
    [J]. AI 2018: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, 11320 : 345 - 357
  • [9] Multiple imputation for missing data - A cautionary tale
    Allison, PD
    [J]. SOCIOLOGICAL METHODS & RESEARCH, 2000, 28 (03) : 301 - 309
  • [10] [Anonymous], 2008, FIELD GUIDE GENETIC