Marriage between variable selection and prediction methods to model plant disease risk

被引:1
作者
Suarez, Franco [1 ]
Bruno, Cecilia [1 ,2 ,4 ]
Giannini, Franca Kurina
Pecci, M. Paz Gimenez
Pardina, Patricia Rodriguez [1 ,3 ]
Balzarini, Monica
机构
[1] Consejo Nacl Invest Cient & Tecn, Unidad Fitopatol & Modelizac Agr UFyMA, INTA, Ing Agr Felix Aldo Marrone 746,Ciudad Univ, RA-5000 Cordoba, Argentina
[2] Univ Nacl Cordoba, Fac Ciencias Agr, Catedra Estadist & Biometria, Cordoba, Argentina
[3] Inst Nacl Tecnol Agr, Inst Patol Vegetal INTA IPAVE, Cordoba, Argentina
[4] Univ Nacl Cordoba, Fac Ciencias Agr Estadist & Biometria, Grp Vinculado Unidad Fitopatol & Modelizac Agr UFy, INTA CONICET, Ing Agr Aldo Marrone 746,Ciudad Universitaria,Ofic, RA-5000 Cordoba, Argentina
关键词
Logistic regression; Random forest; Feature selection; Prediction models; Multicollinearity; Pathosystems; PARTIAL LEAST-SQUARES; REGRESSION;
D O I
10.1016/j.eja.2023.126995
中图分类号
S3 [农学(农艺学)];
学科分类号
0901 ;
摘要
Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus -one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)- and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF pre-diction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] A novel variable selection method based on frequent pattern tree for real-time traffic accident risk prediction
    Lin, Lei
    Wang, Qian
    Sadek, Adel W.
    TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2015, 55 : 444 - 459
  • [42] Comparison of variable and model selection methods for genetic association studies using the GAW15 simulated data
    Zhan Ye
    Elizabeth J Atkinson
    Brooke L Fridley
    Mariza de Andrade
    BMC Proceedings, 1 (Suppl 1)
  • [43] Towards Optimal Variable Selection Methods for Soil Property Prediction Using a Regional Soil Vis-NIR Spectral Library
    Zhang, Xianglin
    Xue, Jie
    Xiao, Yi
    Shi, Zhou
    Chen, Songchao
    REMOTE SENSING, 2023, 15 (02)
  • [44] A Non-asymptotic Risk Bound for Model Selection in a High-Dimensional Mixture of Experts via Joint Rank and Variable Selection
    TrungTin Nguyen
    Dung Ngoc Nguyen
    Hien Duy Nguyen
    Chamroukhi, Faicel
    ADVANCES IN ARTIFICIAL INTELLIGENCE, AI 2023, PT II, 2024, 14472 : 234 - 245
  • [45] Development of a general logistic model for disease risk prediction using multiple SNPs
    Long, Cheng
    Lv, Guanting
    Fu, Xinmiao
    FEBS OPEN BIO, 2019, 9 (11): : 2006 - 2012
  • [46] A Crop Harvest Time Prediction Model for Better Sustainability, Integrating Feature Selection and Artificial Intelligence Methods
    Liu, Shu-Chu
    Jian, Quan-Ying
    Wen, Hsien-Yin
    Chung, Chih-Hung
    SUSTAINABILITY, 2022, 14 (21)
  • [47] The impact of different imputation methods on estimates and model performance: an example using a risk prediction model for premature mortality
    Hurst, Mackenzie
    O'Neill, Meghan
    Pagalan, Lief
    Diemert, Lori M.
    Rosella, Laura C.
    POPULATION HEALTH METRICS, 2024, 22 (01):
  • [48] Heart Disease Prediction Model Using Feature Selection and Ensemble Deep Learning with Optimized Weight
    Al-Mahdi, Iman S.
    Darwish, Saad M.
    Madbouly, Magda M.
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2025,
  • [49] NeuroPpred-Fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods
    Jiang, Mingming
    Zhao, Bowen
    Luo, Shenggan
    Wang, Qiankun
    Chu, Yanyi
    Chen, Tianhang
    Mao, Xueying
    Liu, Yatong
    Wang, Yanjing
    Jiang, Xue
    Wei, Dong-Qing
    Xiong, Yi
    BRIEFINGS IN BIOINFORMATICS, 2021, 22 (06)
  • [50] Variable selection for high-dimensional partly linear additive Cox model with application to Alzheimer's disease
    Wu, Qiwei
    Zhao, Hui
    Zhu, Liang
    Sun, Jianguo
    STATISTICS IN MEDICINE, 2020, 39 (23) : 3120 - 3134