Marriage between variable selection and prediction methods to model plant disease risk

被引:1
作者
Suarez, Franco [1 ]
Bruno, Cecilia [1 ,2 ,4 ]
Giannini, Franca Kurina
Pecci, M. Paz Gimenez
Pardina, Patricia Rodriguez [1 ,3 ]
Balzarini, Monica
机构
[1] Consejo Nacl Invest Cient & Tecn, Unidad Fitopatol & Modelizac Agr UFyMA, INTA, Ing Agr Felix Aldo Marrone 746,Ciudad Univ, RA-5000 Cordoba, Argentina
[2] Univ Nacl Cordoba, Fac Ciencias Agr, Catedra Estadist & Biometria, Cordoba, Argentina
[3] Inst Nacl Tecnol Agr, Inst Patol Vegetal INTA IPAVE, Cordoba, Argentina
[4] Univ Nacl Cordoba, Fac Ciencias Agr Estadist & Biometria, Grp Vinculado Unidad Fitopatol & Modelizac Agr UFy, INTA CONICET, Ing Agr Aldo Marrone 746,Ciudad Universitaria,Ofic, RA-5000 Cordoba, Argentina
关键词
Logistic regression; Random forest; Feature selection; Prediction models; Multicollinearity; Pathosystems; PARTIAL LEAST-SQUARES; REGRESSION;
D O I
10.1016/j.eja.2023.126995
中图分类号
S3 [农学(农艺学)];
学科分类号
0901 ;
摘要
Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus -one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)- and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF pre-diction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Filter Methods of Variable Selection for Enterprise Credit Risk Prediction
    Spicas, Renatas
    Kanapickiene, Rasa
    Ivaskeviciute, Monika
    PERSPECTIVES OF BUSINESS AND ENTREPRENEURSHIP DEVELOPMENT: ECONOMIC, MANAGEMENT, FINANCE AND SYSTEM ENGINEERING FROM THE ACADEMIC AND PRACTITIONERS VIEWS, 2015, : 147 - 160
  • [2] COMPARISON OF VARIABLE SELECTION METHODS FOR OPTIMIZING THE CALIBRATION OF CLINICAL PREDICTION MODEL
    Shiko, Yuki
    Takashima, Ikumi
    Dan, Ippeita
    Kawasaki, Yohei
    JP JOURNAL OF BIOSTATISTICS, 2021, 18 (02) : 269 - 294
  • [3] Bio inspired Ensemble Feature Selection (BEFS) Model with Machine Learning and Data Mining Algorithms for Disease Risk Prediction
    Pasha, Syed Javeed
    Mohamed, E. Syed
    2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2019,
  • [4] Variable Selection Methods in Dredger Production Model
    Zhang, Yinfeng
    Su, Zhen
    Fu, Jingqi
    INTELLIGENT COMPUTING AND INTERNET OF THINGS, PT II, 2018, 924 : 155 - 165
  • [5] The variable selection methods and algorithms in the multiple linear model
    Wei, Gongding
    Yu, Mingyuan
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2023, 52 (17) : 6232 - 6240
  • [6] Spectrometric prediction of wood basic density by comparison of different grain angles and variable selection methods
    Li, Yanjie
    Liu, Wenjian
    Cao, Ruishu
    Tan, Zifeng
    Liu, Jun
    Jiang, Jingmin
    PLANT METHODS, 2021, 17 (01)
  • [7] A comparison of random forest variable selection methods for classification prediction modeling
    Speiser, Jaime Lynn
    Miller, Michael E.
    Tooze, Janet
    Ip, Edward
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 134 : 93 - 101
  • [8] Prediction analysis for Parkinson disease using multiple feature selection & classification methods
    Hema M.S.
    Maheshprabhu R.
    Reddy K.S.
    Guptha M.N.
    Pandimurugan V.
    Multimedia Tools and Applications, 2023, 82 (27) : 42995 - 43012
  • [9] Ensemble Gain Ratio Feature Selection (EGFS) Model with Machine Learning and Data Mining Algorithms for Disease Risk Prediction
    Pasha, Syed Javeed
    Mohamed, E. Syed
    PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT-2020), 2020, : 590 - 596
  • [10] Spatial Prediction of Soil Salinity in a Semiarid Oasis: Environmental Sensitive Variable Selection and Model Comparison
    Li, Zhen
    Li, Yong
    Xing, An
    Zhuo, Zhiqing
    Zhang, Shiwen
    Zhang, Yuanpei
    Huang, Yuanfang
    CHINESE GEOGRAPHICAL SCIENCE, 2019, 29 (05) : 784 - 797