Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival

被引:9
|
作者
Vilardell, Mireia [4 ]
Buxo, Maria [5 ,6 ]
Cleries, Ramon [7 ,8 ]
Martinez, Jose Miguel [9 ,10 ,11 ]
Garcia, Gemma [4 ]
Ameijide, Alberto [12 ]
Font, Rebeca [7 ]
Civit, Sergi [4 ]
Marcos-Gragera, Rafael [1 ,2 ,5 ,6 ]
Vilardell, Maria Loreto [6 ]
Carulla, Maria [12 ]
Espinas, Josep Alfons [7 ]
Galceran, Jaume [12 ]
Izquierdo, Angel [3 ,6 ]
Borras, Josep Ma [7 ,8 ]
机构
[1] Univ Girona UdG, Sch Med, Girona, Spain
[2] Ctr Invest Biomed Red Epidemiol & Salud Publ CIBE, Madrid, Spain
[3] Hosp Univ Girona Doctor Josep Trueta, Inst Catala Oncol, Serv Oncol Med, Girona 17005, Spain
[4] Univ Barcelona, Secc Estadist, Dept Genet Microbiol & Estadist, Fac Biol, Barcelona 08028, Spain
[5] IDIBGI, Inst Invest Biomed Girona, C Dr Castany S-N Edifici M2, Salt 17190, Spain
[6] Grup Epidemiol Descript Genet & Prevencio Canc Gi, Inst Catala Oncol, Registre Canc Girona Unitat Epidemiol, Pla Director Oncol, Girona 17005, Spain
[7] IDIBELL, Oncol, Ave Gran Via 199-203, Lhospitalet De Llobregat 08908, Spain
[8] Univ Barcelona, Dept Ciencies Clin, Barcelona 08907, Spain
[9] MC Mutual, Dept Anal & Planificac Recursos Sanitarios, Barcelona 08037, Spain
[10] Tech Univ Catalonia, Dept Stat, Barcelona 08028, Spain
[11] Univ Alicante, Publ Hlth Res Grp, Alicante 03690, Spain
[12] Hosp Univ St Joan Reus, Registre Canc Tarragona, Serv Epidemiol & Prevencio Canc, IISPV, Reus, Spain
关键词
Breast cancer; Survival; Graphical models; Missing data; Oversampling; Simulation; COVARIATE DATA; SPAIN; STAGE; DISCRETE; SMOTE;
D O I
10.1016/j.artmed.2020.101875
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background: Two common issues may arise in certain population-based breast cancer (BC) survival studies: I) missing values in a survivals' predictive variable, such as "Stage" at diagnosis, and II) small sample size due to "imbalance class problem" in certain subsets of patients, demanding data modeling/simulation methods. Methods: We present a procedure, ModGraProDep, based on graphical modeling (GM) of a dataset to overcome these two issues. The performance of the models derived from ModGraProDep is compared with a set of frequently used classification and machine learning algorithms (Missing Data Problem) and with oversampling algorithms (Synthetic Data Simulation). For the Missing Data Problem we assessed two scenarios: missing completely at random (MCAR) and missing not at random (MNAR). Two validated BC datasets provided by the cancer registries of Girona and Tarragona (northeastern Spain) were used. Results: In both MCAR and MNAR scenarios all models showed poorer prediction performance compared to three GM models: the saturated one (GM.SAT) and two with penalty factors on the partial likelihood (GM.K1 and GM.TEST). However, GM.SAT predictions could lead to non-reliable conclusions in BC survival analysis. Simulation of a "synthetic" dataset derived from GM.SAT could be the worst strategy, but the use of the remaining GMs models could be better than oversampling. Conclusion: Our results suggest the use of the GM-procedure presented for one-variable imputation/prediction of missing data and for simulating "synthetic" BC survival datasets. The "synthetic" datasets derived from GMs could be also used in clinical applications of cancer survival data such as predictive risk analysis.
引用
收藏
页数:11
相关论文
共 5 条
  • [1] Impact of Imputation of Missing Data on Estimation of Survival Rates: An Example in Breast Cancer
    Baneshi, M. R.
    Talei, A. R.
    IRANIAN JOURNAL OF CANCER PREVENTION, 2010, 3 (03) : 127 - 131
  • [2] Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values
    Garcia-Laencina, Pedro J.
    Abreu, Pedro Henriques
    Abreu, Miguel Henriques
    Afonoso, Noemia
    COMPUTERS IN BIOLOGY AND MEDICINE, 2015, 59 : 125 - 133
  • [3] Using routinely collected health data to investigate the association between ethnicity and breast cancer incidence and survival: what is the impact of missing data and multiple ethnicities?
    Downing, Amy
    West, Robert M.
    Gilthorpe, Mark S.
    Lawrence, Gill
    Forman, David
    ETHNICITY & HEALTH, 2011, 16 (03) : 201 - 212
  • [4] Matching methods to create paired survival data based on an exposure occurring over time: a simulation study with application to breast cancer
    Alexia Savignoni
    Caroline Giard
    Pascale Tubert-Bitter
    Yann De Rycke
    BMC Medical Research Methodology, 14
  • [5] Matching methods to create paired survival data based on an exposure occurring over time: a simulation study with application to breast cancer
    Savignoni, Alexia
    Giard, Caroline
    Tubert-Bitter, Pascale
    De Rycke, Yann
    BMC MEDICAL RESEARCH METHODOLOGY, 2014, 14