Marriage between variable selection and prediction methods to model plant disease risk

被引:1
作者
Suarez, Franco [1 ]
Bruno, Cecilia [1 ,2 ,4 ]
Giannini, Franca Kurina
Pecci, M. Paz Gimenez
Pardina, Patricia Rodriguez [1 ,3 ]
Balzarini, Monica
机构
[1] Consejo Nacl Invest Cient & Tecn, Unidad Fitopatol & Modelizac Agr UFyMA, INTA, Ing Agr Felix Aldo Marrone 746,Ciudad Univ, RA-5000 Cordoba, Argentina
[2] Univ Nacl Cordoba, Fac Ciencias Agr, Catedra Estadist & Biometria, Cordoba, Argentina
[3] Inst Nacl Tecnol Agr, Inst Patol Vegetal INTA IPAVE, Cordoba, Argentina
[4] Univ Nacl Cordoba, Fac Ciencias Agr Estadist & Biometria, Grp Vinculado Unidad Fitopatol & Modelizac Agr UFy, INTA CONICET, Ing Agr Aldo Marrone 746,Ciudad Universitaria,Ofic, RA-5000 Cordoba, Argentina
关键词
Logistic regression; Random forest; Feature selection; Prediction models; Multicollinearity; Pathosystems; PARTIAL LEAST-SQUARES; REGRESSION;
D O I
10.1016/j.eja.2023.126995
中图分类号
S3 [农学(农艺学)];
学科分类号
0901 ;
摘要
Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus -one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)- and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF pre-diction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] A methodology for comparing classification methods through the assessment of model stability and validity in variable selection
    Shreve, J.
    Schneider, H.
    Soysal, O.
    DECISION SUPPORT SYSTEMS, 2011, 52 (01) : 247 - 257
  • [22] Development and evaluation of a chronic kidney disease risk prediction model using random forest
    Mendapara, Krish
    FRONTIERS IN GENETICS, 2024, 15
  • [23] A Benchmark Feature Selection Framework for Non Communicable Disease Prediction Model
    Sutanto, Daniel Hartono
    Abd Ghani, Mohd. Khanapi
    ADVANCED SCIENCE LETTERS, 2015, 21 (10) : 3409 - 3416
  • [24] Comparison Between Linear and Non-linear Variable Selection Methods with Applications to Spectroscopic (UV-Vis/NIR) Data
    Krongchai, Chanida
    Wongsaipun, Sakunna
    Funsueb, Sujitra
    Theanjumpol, Parichat
    Jakmunee, Jaroon
    Kittiwachana, Sila
    CHIANG MAI JOURNAL OF SCIENCE, 2020, 47 (01): : 160 - 174
  • [25] Methods for updating a risk prediction model for cardiac surgery: a statistical primer
    Siregar, Sabrina
    Nieboer, Daan
    Versteegh, Michel I. M.
    Steyerberg, Ewout W.
    Takkenberg, Johanna J. M.
    INTERACTIVE CARDIOVASCULAR AND THORACIC SURGERY, 2019, 28 (03) : 333 - 338
  • [26] Prediction of Placental Barrier Permeability: A Model Based on Partial Least Squares Variable Selection Procedure
    Zhang, Yong-Hong
    Xia, Zhi-Ning
    Yan, Li
    Liu, Shu-Shen
    MOLECULES, 2015, 20 (05): : 8270 - 8286
  • [27] An Empirical Study on New Model-Free Multi-output Variable Selection Methods
    Ansari, Jonathan
    Luetkebohmert, Eva
    Rockel, Marcus
    COMBINING, MODELLING AND ANALYZING IMPRECISION, RANDOMNESS AND DEPENDENCE, SMPS 2024, 2024, 1458 : 9 - 17
  • [28] Prediction of internal egg quality characteristics and variable selection using regularization methods: ridge, LASSO and elastic net
    Ciftsuren, Mehmet Nur
    Akkol, Suna
    ARCHIVES ANIMAL BREEDING, 2018, 61 (03) : 279 - 284
  • [29] Novel hybrid methods applied for spatial prediction of mercury and variable selection of trace elements in coastal areas of USA
    Sakizadeh, Mohammad
    MARINE POLLUTION BULLETIN, 2020, 150
  • [30] Feature selection and risk prediction for patients with coronary artery disease using data mining
    Nashreen Md Idris
    Yin Kia Chiam
    Kasturi Dewi Varathan
    Wan Azman Wan Ahmad
    Kok Han Chee
    Yih Miin Liew
    Medical & Biological Engineering & Computing, 2020, 58 : 3123 - 3140