Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models

被引:0
|
作者
Larson, David B. [1 ,2 ]
Koirala, Arogya [2 ]
Cheuy, Lina Y. [1 ,2 ]
Paschali, Magdalini [1 ,2 ]
Van Veen, Dave [3 ]
Na, Hye Sun [1 ,2 ]
Petterson, Matthew B. [1 ]
Fang, Zhongnan [1 ,2 ]
Chaudhari, Akshay S. [1 ,2 ,4 ]
机构
[1] Stanford Univ, Dept Radiol, Sch Med, 453 Quarry Rd,MC 5659, Stanford, CA 94304 USA
[2] Stanford Univ, AI Dev & Evaluat Lab, Sch Med, Palo Alto, CA 94305 USA
[3] Stanford Univ, Dept Elect Engn, Stanford, CA USA
[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA USA
关键词
D O I
10.1148/radiol.241051
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose: To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods: This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements "past medical history," "what," "when," "where," and "clinical concern" from clinical histories. Model performance, interreader agreement using Cohen kappa (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen kappa, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results: A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean kappa, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; P = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; P = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion: An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Benchmarking Open-Source Large Language Models on Code-Switched Tagalog-English Retrieval Augmented Generation
    Adoptante, Aunhel John M.
    Castro, Jasper Adrian Dwight, V
    Medrana, Micholo Lanz B.
    Ocampo, Alyssa Patricia B.
    Peramo, Elmer C.
    Miranda, Melissa Ruth M.
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2025, 16 (02) : 233 - 242
  • [42] Harnessing Large Language Models for Simulink Toolchain Testing and Developing Diverse Open-Source Corpora of Simulink Models for Metric and Evolution Analysis
    Shrestha, Sohil Lal
    PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023, 2023, : 1541 - 1545
  • [43] Evaluating local open-source large language models for data extraction from unstructured reports on mechanical thrombectomy in patients with ischemic stroke
    Meddeb, Aymen
    Ebert, Philipe
    Bressem, Keno Kyrill
    Desser, Dmitriy
    Dell'Orco, Andrea
    Bohner, Georg
    Kleine, Justus F.
    Siebert, Eberhard
    Grauhan, Nils
    Brockmann, Marc A.
    Othman, Ahmed
    Scheel, Michael
    Nawabi, Jawed
    JOURNAL OF NEUROINTERVENTIONAL SURGERY, 2024,
  • [44] Ember: An open-source, transient solver for 1D reacting flow using large kinetic models, applied to strained extinction
    Long, Alan E.
    Speth, Raymond L.
    Green, William H.
    COMBUSTION AND FLAME, 2018, 195 : 105 - 116
  • [45] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
    Jenish Maharjan
    Anurag Garikipati
    Navan Preet Singh
    Leo Cyrus
    Mayank Sharma
    Madalina Ciobanu
    Gina Barnes
    Rahul Thapa
    Qingqing Mao
    Ritankar Das
    Scientific Reports, 14 (1)
  • [46] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
    Maharjan, Jenish
    Garikipati, Anurag
    Singh, Navan Preet
    Cyrus, Leo
    Sharma, Mayank
    Ciobanu, Madalina
    Barnes, Gina
    Thapa, Rahul
    Mao, Qingqing
    Das, Ritankar
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [47] Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature
    Lozano, Alejandro
    Fleming, Scott L.
    Chiang, Chia-Chun
    Shah, Nigam
    BIOCOMPUTING 2024, PSB 2024, 2024, : 8 - 23
  • [48] A scientific-article key-insight extraction system based on multi-actor of fine-tuned open-source large language models
    Song, Zihan
    Hwang, Gyo-Yeob
    Zhang, Xin
    Huang, Shan
    Park, Byung-Kwon
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [49] Comparative diagnostic accuracy of GPT-4o and LLaMA 3-70b: Proprietary vs. open-source large language models in radiology☆
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    CLINICAL IMAGING, 2025, 118
  • [50] SCREENING ARTICLES IN A QUALITATIVE LITERATURE REVIEW USING LARGE LANGUAGE MODELS: A COMPARISON OF GPT VERSUS OPEN SOURCE, TRAINED MODELS AGAINST EXPERT RESEARCHER SCREENING
    Hudgens, S.
    Lloyd-Price, L.
    Jafar, R.
    Nourizade, M.
    Burbridge, C.
    Thorlund, K.
    VALUE IN HEALTH, 2024, 27 (06) : S32 - S32