XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction

被引:58
作者
Davagdorj, Khishigsuren [1 ]
Van Huy Pham [2 ]
Theera-Umpon, Nipon [3 ,4 ]
Ryu, Keun Ho [2 ,4 ]
机构
[1] Chungbuk Natl Univ, Coll Elect & Comp Engn, Database & Bioinformat Lab, Cheongju 28644, South Korea
[2] Ton Duc Thang Univ, Fac Informat Technol, Ho Chi Minh City 700000, Vietnam
[3] Chiang Mai Univ, Fac Engn, Dept Elect Engn, Chiang Mai 50200, Thailand
[4] Chiang Mai Univ, Biomed Engn Inst, Chiang Mai 50200, Thailand
基金
新加坡国家研究基金会;
关键词
smoking; noncommunicable disease; feature selection; extreme gradient boosting; CLASSIFICATION; REGRESSION; DIAGNOSIS; SELECTION;
D O I
10.3390/ijerph17186513
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.
引用
收藏
页码:1 / 22
页数:22
相关论文
共 48 条
[1]   Genomic and Bioinformatics Approaches for Analysis of Genes Associated with Cancer Risks Following exposure to Tobacco Smoking [J].
Al-Obaide, Mohammed A. I. ;
Ibrahim, Buthainah A. ;
Al-Humaish, Saif ;
Abdel-Salam, Abdel-Salam G. .
FRONTIERS IN PUBLIC HEALTH, 2018, 6
[2]   STATISTICS NOTES - DIAGNOSTIC-TESTS-1 - SENSITIVITY AND SPECIFICITY .3. [J].
ALTMAN, DG ;
BLAND, JM .
BRITISH MEDICAL JOURNAL, 1994, 308 (6943) :1552-1552
[3]   An improved method of early diagnosis of smoking-induced respiratory changes using machine learning algorithms [J].
Amaral, Jorge L. M. ;
Lopes, Agnaldo J. ;
Jansen, Jose M. ;
Faria, Alvaro C. D. ;
Melo, Pedro L. .
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2013, 112 (03) :441-454
[4]  
[Anonymous], 2016, Action plan for the prevention and control of noncommubicable diseases in the WHO European Region
[5]  
[Anonymous], 2016, LANCET
[6]   Logistic regression in the medical literature: Standards for use and reporting, with particular attention to one medical domain [J].
Bagley, SC ;
White, H ;
Golomb, BA .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2001, 54 (10) :979-985
[7]   High-Dimensional Methods and Inference on Structural and Treatment Effects [J].
Belloni, Alexandre ;
Chernozhukov, Victor ;
Hansen, Christian .
JOURNAL OF ECONOMIC PERSPECTIVES, 2014, 28 (02) :29-50
[8]   COVID-19 and Smoking [J].
Berlin, Ivan ;
Thomas, Daniel ;
Le Faou, Anne-Laurence ;
Cornuz, Jacques .
NICOTINE & TOBACCO RESEARCH, 2020, 22 (09) :1650-1652
[9]   Association between Parkinson's Disease and Cigarette Smoking, Rural Living, Well-Water Consumption, Farming and Pesticide Use: Systematic Review and Meta-Analysis [J].
Breckenridge, Charles B. ;
Berry, Colin ;
Chang, Ellen T. ;
Sielken, Robert L., Jr. ;
Mandel, Jack S. .
PLOS ONE, 2016, 11 (04)
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32