Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by Oncology Care Model (OCM) data

被引:30
|
作者
Mazumdar, Madhu [1 ,2 ]
Lin, Jung-Yi Joyce [1 ,2 ]
Zhang, Wei [3 ]
Li, Lihua [1 ,2 ]
Liu, Mark [2 ]
Dharmarajan, Kavita [4 ]
Sanderson, Mark [5 ]
Isola, Luis [2 ]
Hu, Liangyuan [1 ,2 ]
机构
[1] Icahn Sch Med Mt Sinai, Inst Healthcare Delivery Sci, Dept Populat Hlth Sci & Policy, New York, NY 10029 USA
[2] Mt Sinai Hosp, Tisch Canc Inst, New York, NY 10029 USA
[3] Univ Arkansas, Dept Math & Stat, Little Rock, AR 72204 USA
[4] Mt Sinai Hosp, Dept Radiat Oncol, Brookdale Dept Geriatr & Palliat Med, Icahn Sch Med Mt Sinai, New York, NY 10029 USA
[5] Icahn Sch Med Mt Sinai, Dept Hlth Syst Design & Global Hlth, New York, NY 10029 USA
关键词
Oncology care model; Risk-adjustment model; Machine learning; Quantile regression; Generalized linear model; RISK-ADJUSTMENT; REGRESSION; OUTCOMES;
D O I
10.1186/s12913-020-05148-y
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background The Oncology Care Model (OCM) was developed as a payment model to encourage participating practices to provide better-quality care for cancer patients at a lower cost. The risk-adjustment model used in OCM is a Gamma generalized linear model (Gamma GLM) with log-link. The predicted value of expense for the episodes identified for our academic medical center (AMC), based on the model fitted to the national data, did not correlate well with our observed expense. This motivated us to fit the Gamma GLM to our AMC data and compare it with two other flexible modeling methods: Random Forest (RF) and Partially Linear Additive Quantile Regression (PLAQR). We also performed a simulation study to assess comparative performance of these methods and examined the impact of non-linearity and interaction effects, two understudied aspects in the field of cost prediction. Methods The simulation was designed with an outcome of cost generated from four distributions: Gamma, Weibull, Log-normal with a heteroscedastic error term, and heavy-tailed. Simulation parameters both similar to and different from OCM data were considered. The performance metrics considered were the root mean square error (RMSE), mean absolute prediction error (MAPE), and cost accuracy (CA). Bootstrap resampling was utilized to estimate the operating characteristics of the performance metrics, which were described by boxplots. Results RF attained the best performance with lowest RMSE, MAPE, and highest CA for most of the scenarios. When the models were misspecified, their performance was further differentiated. Model performance differed more for non-exponential than exponential outcome distributions. Conclusions RF outperformed Gamma GLM and PLAQR in predicting overall and top decile costs. RF demonstrated improved prediction under various scenarios common in healthcare cost modeling. Additionally, RF did not require prespecification of outcome distribution, nonlinearity effect, or interaction terms. Therefore, RF appears to be the best tool to predict average cost. However, when the goal is to estimate extreme expenses, e.g., high cost episodes, the accuracy gained by RF versus its computational costs may need to be considered.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by Oncology Care Model (OCM) data
    Madhu Mazumdar
    Jung-Yi Joyce Lin
    Wei Zhang
    Lihua Li
    Mark Liu
    Kavita Dharmarajan
    Mark Sanderson
    Luis Isola
    Liangyuan Hu
    BMC Health Services Research, 20
  • [2] Statistical models for the analysis of skewed healthcare cost data: a simulation study
    Malehi, Amal Saki
    Pourmotahari, Fatemeh
    Angali, Kambiz Ahmadi
    HEALTH ECONOMICS REVIEW, 2015, 5
  • [3] Statistical models for the analysis of skewed healthcare cost data: a simulation study
    Amal Saki Malehi
    Fatemeh Pourmotahari
    Kambiz Ahmadi Angali
    Health Economics Review, 5
  • [4] Machine learning and statistical models for analyzing multilevel patent data
    Sunyun Qi
    Yu Zhang
    Hua Gu
    Fei Zhu
    Meiying Gao
    Hongxiao Liang
    Qifeng Zhang
    Yanchao Gao
    Scientific Reports, 13
  • [5] Machine learning and statistical models for analyzing multilevel patent data
    Qi, Sunyun
    Zhang, Yu
    Gu, Hua
    Zhu, Fei
    Gao, Meiying
    Liang, Hongxiao
    Zhang, Qifeng
    Gao, Yanchao
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [6] Comparison of statistical and machine learning methods in modelling of data with multicollinearity
    Garg, Akhil
    Tai, Kang
    INTERNATIONAL JOURNAL OF MODELLING IDENTIFICATION AND CONTROL, 2013, 18 (04) : 295 - 312
  • [7] The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
    Langenberger, Benedikt
    Schulte, Timo
    Groene, Oliver
    PLOS ONE, 2023, 18 (01):
  • [8] A Performance Comparison of Statistical and Machine Learning Techniques in Learning Time Series Data
    Haviluddin
    Alfred, Rayner
    Obit, Joe Henry
    Hijazi, Mohd Hanafi Ahmad
    Ibrahim, Ag Asri Ag
    ADVANCED SCIENCE LETTERS, 2015, 21 (10) : 3037 - 3041
  • [9] Data-Driven Computational Neuroscience: Machine Learning and Statistical Models
    Kreinovich, Vladik
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 41 (01) : 2513 - 2514
  • [10] Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models
    Tollenaar, N.
    van der Heijden, P. G. M.
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2013, 176 (02) : 565 - 584