Postprediction Inference for Clinical Characteristics Extracted With Machine Learning on Electronic Health Records

被引:0
作者
Sondhi, Arjun [1 ]
Rich, Alexander S. [1 ]
Wang, Siruo [2 ]
Leek, Jeffery T. [2 ]
机构
[1] Flatiron Hlth Inc, New York, NY USA
[2] Johns Hopkins Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA
关键词
D O I
10.1200/CCI.22.00174
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
PURPOSE Real-world data (RWD) derived from electronic health records (EHRs) are often used to understand population-level relationships between patient characteristics and cancer outcomes. Machine learning (ML) methods enable researchers to extract characteristics from unstructured clinical notes, and represent a more cost-effective and scalable approach than manual expert abstraction. These extracted data are then used in epidemiologic or statistical models as if they were abstracted observations. Analytical results derived from extracted data in this way may differ from those given by abstracted data, and the magnitude of this difference is not directly informed by standard ML performance metrics. METHODS In this paper, we define the task of postprediction inference, which is to recover similar estimation and inference from an ML-extracted variable that would be obtained from abstracting the variable. We consider fitting a Cox proportional hazards model that uses a binary ML-extracted variable as a covariate and evaluate four approaches for postprediction inference in this setting. The first two approaches only require the ML-predicted probability, while the latter two additionally require a labeled (human abstracted) validation data set. RESULTS Our results for both simulated data and EHR-derived RWD from a national cohort demonstrate that we can improve inference from ML-extracted variables by leveraging a limited amount of labeled data. CONCLUSION We describe and evaluate methods for fitting statistical models usingML-extracted variables subject to model error. We show that estimation and inference is generally valid when using extracted data from highperforming MLmodels. More complexmethods that incorporate auxiliary labeled data provide further improvements.
引用
收藏
页数:10
相关论文
共 14 条
[1]   The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records [J].
Assale, Michela ;
Dui, Linda Greta ;
Cina, Andrea ;
Seveso, Andrea ;
Cabitza, Federico .
FRONTIERS IN MEDICINE, 2019, 6
[2]  
Baxi SS., 2023, PREPRINT, DOI [10.1101/2020.03.16.20037143, DOI 10.1101/2020.03.16.20037143]
[3]  
Birnbaum B, 2020, Arxiv, DOI [arXiv:2001.09765, 10.48550/arXiv.2001.09765, DOI 10.48550/ARXIV.2001.09765]
[4]   Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework [J].
Estevez, Melissa ;
Benedum, Corey M. ;
Jiang, Chengsheng ;
Cohen, Aaron B. ;
Phadke, Sharang ;
Sarkar, Somnath ;
Bozkurt, Selen .
CANCERS, 2022, 14 (13)
[5]   An overview of real-world data sources for oncology and considerations for research [J].
Penberthy, Lynne T. ;
Rivera, Donna R. ;
Lund, Jennifer L. ;
Bruno, Melissa A. ;
Meyer, Anne-Marie .
CA-A CANCER JOURNAL FOR CLINICIANS, 2022, 72 (03) :287-300
[6]   Multiple imputation when records used for imputation are not used or disseminated for analysis [J].
Reiter, Jerome P. .
BIOMETRIKA, 2008, 95 (04) :933-946
[7]  
RUBIN DB, 1976, BIOMETRIKA, V63, P581, DOI 10.1093/biomet/63.3.581
[8]   Direct importance estimation for covariate shift adaptation [J].
Sugiyama, Masashi ;
Suzuki, Taiji ;
Nakajima, Shinichi ;
Kashima, Hisashi ;
von Buenau, Paul ;
Kawanabe, Motoaki .
ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2008, 60 (04) :699-746
[9]  
Sundermeyer M, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P194
[10]  
Teshima Takeshi, 2020, NEURIPS