Marriage between variable selection and prediction methods to model plant disease risk

被引:1
作者
Suarez, Franco [1 ]
Bruno, Cecilia [1 ,2 ,4 ]
Giannini, Franca Kurina
Pecci, M. Paz Gimenez
Pardina, Patricia Rodriguez [1 ,3 ]
Balzarini, Monica
机构
[1] Consejo Nacl Invest Cient & Tecn, Unidad Fitopatol & Modelizac Agr UFyMA, INTA, Ing Agr Felix Aldo Marrone 746,Ciudad Univ, RA-5000 Cordoba, Argentina
[2] Univ Nacl Cordoba, Fac Ciencias Agr, Catedra Estadist & Biometria, Cordoba, Argentina
[3] Inst Nacl Tecnol Agr, Inst Patol Vegetal INTA IPAVE, Cordoba, Argentina
[4] Univ Nacl Cordoba, Fac Ciencias Agr Estadist & Biometria, Grp Vinculado Unidad Fitopatol & Modelizac Agr UFy, INTA CONICET, Ing Agr Aldo Marrone 746,Ciudad Universitaria,Ofic, RA-5000 Cordoba, Argentina
关键词
Logistic regression; Random forest; Feature selection; Prediction models; Multicollinearity; Pathosystems; PARTIAL LEAST-SQUARES; REGRESSION;
D O I
10.1016/j.eja.2023.126995
中图分类号
S3 [农学(农艺学)];
学科分类号
0901 ;
摘要
Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus -one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)- and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF pre-diction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Feature selection and risk prediction for patients with coronary artery disease using data mining
    Md Idris, Nashreen
    Chiam, Yin Kia
    Varathan, Kasturi Dewi
    Wan Ahmad, Wan Azman
    Chee, Kok Han
    Liew, Yih Miin
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2020, 58 (12) : 3123 - 3140
  • [32] Splice sites prediction of Human genome using length-variable Markov model and feature selection
    Zhang, Quanwei
    Peng, Qinke
    Zhang, Qi
    Yan, Yanhua
    Li, Kankan
    Li, Jing
    EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (04) : 2771 - 2782
  • [33] Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model-Averaged Causal Effects
    Zigler, Corwin Matthew
    Dominici, Francesca
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2014, 109 (505) : 95 - 107
  • [34] Logistic Regression Model Based on Ultrafast Pulse Wave Velocity and Different Feature Selection Methods to Predict the Risk of Hypertension
    Bai, Xue
    Liu, Wenjun
    Huang, Hui
    You, Huan
    IRANIAN JOURNAL OF PUBLIC HEALTH, 2022, 51 (09) : 2099 - 2107
  • [35] Effectiveness of Shrinkage and Variable Selection Methods for the Prediction of Complex Human Traits using Data from Distantly Related Individuals
    Berger, Swetlana
    Perez-Rodriguez, Paulino
    Veturi, Yogasudha
    Simianer, Henner
    de los Campos, Gustavo
    ANNALS OF HUMAN GENETICS, 2015, 79 (02) : 122 - 135
  • [36] Construction of a disease risk prediction model for postherpetic pruritus by machine learning
    Lin, Zheng
    Dou, Yuan
    Ju, Ru-yi
    Lin, Ping
    Cao, Yi
    FRONTIERS IN MEDICINE, 2024, 11
  • [37] QEFS: A novel plant disease prediction approach using quantum-inspired evolutionary feature selection
    Anand, Khushi
    Jain, Bhawna
    Mittal, Himanshu
    Yadav, Vijay Kumar
    APPLIED INTELLIGENCE, 2025, 55 (02)
  • [38] Risk Prediction Model for Crohn's Disease Based on Hematological Indicators
    Zeng, Tao
    Xiao, Yushan
    Huang, Zhongchao
    Wang, Xianghui
    Hu, Suhua
    Huang, Jiahui
    Liu, Huanliang
    CLINICAL LABORATORY, 2023, 69 (07) : 1434 - 1442
  • [39] Classification model for heart disease prediction with feature selection through modified bee algorithm
    Velswamy, Karunakaran
    Velswamy, Rajasekar
    Swamidason, Iwin Thanakumar Joseph
    Chinnaiyan, Selvan
    SOFT COMPUTING, 2022, 26 (23) : 13049 - 13057
  • [40] Classification model for heart disease prediction with feature selection through modified bee algorithm
    Karunakaran Velswamy
    Rajasekar Velswamy
    Iwin Thanakumar Joseph Swamidason
    Selvan Chinnaiyan
    Soft Computing, 2022, 26 : 13049 - 13057