The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance

被引:15
作者
Barakat M.S. [1 ,2 ]
Field M. [1 ,2 ]
Ghose A. [3 ]
Stirling D. [4 ]
Holloway L. [1 ,2 ,5 ]
Vinod S. [1 ,5 ]
Dekker A. [6 ]
Thwaites D. [7 ]
机构
[1] South Western Sydney Clinical School, UNSW, Liverpool, 2170, NSW
[2] Ingham Institute for Applied Medical Research, 1 Campbell St, Liverpool, 2170, NSW
[3] School of Computing and Information Technology, University of Wollongong, Northfield Ave, Wollongong, 2522, NSW
[4] School of Electrical, Computer and Telecommunication Engineering, University of Wollongong, Northfield Ave, Wollongong, 2522, NSW
[5] Liverpool & Macarthur Cancer Therapy Centre, Sydney, 2170, NSW
[6] Department of Radiation Oncology (MAASTRO), GROW School for Oncology and Developmental Biology, Maastricht University, Maastricht
[7] Institute of Medical Physics, School of Physics, University of Sydney, Sydney, 2006, NSW
关键词
Decision Support; Imputation; Missing data; Modeling and Lung Cancer;
D O I
10.1007/s13755-017-0039-4
中图分类号
学科分类号
摘要
According to the estimations of the World Health Organization and the International Agency for Research in Cancer, lung cancer is the most common cause of death from cancer worldwide. The last few years have witnessed a rise in the attention given to the use of clinical decision support systems in medicine generally and in cancer in particular. These can predict patients’ likelihood of survival based on analysis of and learning from previously treated patients. The datasets that are mined for developing clinical decision support functionality are often incomplete, which adversely impacts the quality of the models developed and the decision support offered. Imputing missing data using a statistical analysis approach is a common method to addressing the missing data problem. This work investigates the effect of imputation methods for missing data in preparing a training dataset for a Non-Small Cell Lung Cancer survival prediction model using several machine learning algorithms. The investigation includes an assessment of the effect of imputation algorithm error on performance prediction and also a comparison between using a smaller complete real dataset or a larger dataset with imputed data. Our results show that even when the proportion of records with some missing data is very high (> 80%) imputation can lead to prediction models with an AUC (0.68–0.72) comparable to those trained with complete data records. © 2017, Springer International Publishing AG, part of Springer Nature.
引用
收藏
相关论文
共 31 条
[1]  
Estimated cancer incidence, mortality and prevalence worldwide in 2012, International Agency for Cancer Research, (2016)
[2]  
Key statistics for lung cancer. American Cancer Society, (2016)
[3]  
Dekker A., Et al., Rapid learning in practice: a lung cancer survival decision support system in routine patient care data, Radiother Oncol, 113, 1, pp. 47-53, (2014)
[4]  
Abernethy A.P., Et al., Rapid-learning system for cancer care, J Clin Oncol, 28, 27, pp. 4268-4274, (2010)
[5]  
Sammut C., Webb G.I., Encyclopedia of machine learning, (2011)
[6]  
Beleites C., Neugebauer U., Bocklitz T., Krafft C., Popp J., Sample size planning for classification models, Anal Chim Acta, 760, pp. 25-33, (2013)
[7]  
Garcia-Laencina P.J., Abreu P.H., Abreu M.H., Afonoso N., Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values, Comput Biol Med, 59, pp. 125-133, (2015)
[8]  
Jayasurya K., Et al., Comparison of Bayesian network and support vector machine models for two-year survival prediction in lung cancer patients treated with radiotherapy, Med Phys, 37, 4, (2010)
[9]  
Garcia-Laencina P.J., Sancho-Gomez J.-L., Figueiras-Vidal A.R., Pattern classification with missing data: a review, Neural Comput Appl, 19, 2, pp. 263-282, (2009)
[10]  
Sterne J.A.C., Et al., Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls, BMJ, 338, (2009)