The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis

被引:0
作者
Zurita, Amadeo Jesus Wals [1 ]
del Rio, Hector Miras [1 ]
de Aguirre, Nerea Ugarte Ruiz [1 ]
Navarro, Cristina Nebrera [1 ]
Jimenez, Maria Rubio [1 ]
Carmona, David Munoz [1 ]
Sanchez, Carlos Miguez [1 ]
机构
[1] Andalusian Hlth Serv, Hosp Univ Virgen Macarena, Serv Oncol Radioterap, Ave Dr Fedriani S-N, Seville 41009, Spain
关键词
electronic health record; EHR; oncology; radiotherapy; data mining; ChatGPT; large language models; LLMs;
D O I
10.2196/58457
中图分类号
R-058 [];
学科分类号
摘要
Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Objective: We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Methods: We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, kappa index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models. Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P =.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P <.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives. Conclusions: This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.
引用
收藏
页数:15
相关论文
共 24 条
  • [1] Achiam J., 2023, ARXIV, DOI DOI 10.48550/ARXIV.2303.08774
  • [2] Approach to machine learning for extraction of real-world data variables from electronic health records
    Adamson, Blythe
    Waskom, Michael
    Blarre, Auriane
    Kelly, Jonathan
    Krismer, Konstantin
    Nemeth, Sheila
    Gippetti, James
    Ritten, John
    Harrison, Katherine
    Ho, George
    Linzmayer, Robin
    Bansal, Tarun
    Wilkinson, Samuel
    Amster, Guy
    Estola, Evan
    Benedum, Corey M.
    Fidyk, Erin
    Estevez, Melissa
    Shapiro, Will
    Cohen, Aaron B.
    [J]. FRONTIERS IN PHARMACOLOGY, 2023, 14
  • [3] [Anonymous], [12] Khronos Group, . URL https://www.khronos.org/. last accessed: February 27 2024.
  • [4] Organic generation of real-world real-time data for clinical evidence in radiation oncology
    Bertolet, A.
    Wals, A.
    Miras, H.
    Macias, J.
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2020, 144
  • [5] Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer
    Choi, Hyeon Seok
    Song, Jun Yeong
    Shin, Kyung Hwan
    Chang, Ji Hyun
    Jang, Bum-Sup
    [J]. RADIATION ONCOLOGY JOURNAL, 2023, 41 (03): : 209 - 216
  • [6] Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer
    Fink, Matthias A.
    Bischoff, Arved
    Fink, Christoph A.
    Moll, Martin
    Kroschke, Jonas
    Dulz, Luca
    Heussel, Claus Peter
    Kauczor, Hans-Ulrich
    Weber, Tim F.
    [J]. RADIOLOGY, 2023, 308 (03)
  • [7] Hendrycks D, 2021, Arxiv, DOI [arXiv:2009.03300, DOI 10.48550/ARXIV.2009.03300]
  • [8] ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis
    Hoppe, John Michael
    Auer, Matthias K.
    Strueven, Anna
    Massberg, Steffen
    Stremmel, Christopher
    [J]. JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [9] Jin YQ, 2023, Arxiv, DOI arXiv:2310.13132
  • [10] From real-world electronic health record data to real-world results using artificial intelligence
    Knevel, Rachel
    Liao, Katherine P.
    [J]. ANNALS OF THE RHEUMATIC DISEASES, 2023, 82 (03) : 306 - 311