Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models

被引:0
|
作者
Larson, David B. [1 ,2 ]
Koirala, Arogya [2 ]
Cheuy, Lina Y. [1 ,2 ]
Paschali, Magdalini [1 ,2 ]
Van Veen, Dave [3 ]
Na, Hye Sun [1 ,2 ]
Petterson, Matthew B. [1 ]
Fang, Zhongnan [1 ,2 ]
Chaudhari, Akshay S. [1 ,2 ,4 ]
机构
[1] Stanford Univ, Dept Radiol, Sch Med, 453 Quarry Rd,MC 5659, Stanford, CA 94304 USA
[2] Stanford Univ, AI Dev & Evaluat Lab, Sch Med, Palo Alto, CA 94305 USA
[3] Stanford Univ, Dept Elect Engn, Stanford, CA USA
[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA USA
关键词
D O I
10.1148/radiol.241051
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose: To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods: This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements "past medical history," "what," "when," "where," and "clinical concern" from clinical histories. Model performance, interreader agreement using Cohen kappa (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen kappa, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results: A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean kappa, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; P = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; P = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion: An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation
    Liu, Chuanlong
    Liao, Wei
    Xu, Zhen
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2025,
  • [32] A Systematic Comparison Between Open- and Closed-Source Large Language Models in the Context of Generating GDPR-Compliant Data Categories for Processing Activity Records
    von Schwerin, Magdalena
    Reichert, Manfred
    Future Internet, 2024, 16 (12):
  • [33] Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
    Zhuang, Shengyao
    Liu, Bing
    Koopman, Bevan
    Zuccon, Guido
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8807 - 8817
  • [34] Open-Source Large Language Models in Anesthesia Perioperative Medicine: ASA-Physical Status Evaluation
    Rouholiman, Dara
    Goodell, Alex J.
    Fung, Ethan
    Chandrasoma, Janak T.
    Chu, Larry F.
    ANESTHESIA AND ANALGESIA, 2024, 139 (06): : 2779 - 2781
  • [35] Exploration of Using an Open-Source Large Language Model for Analyzing Trial Information: A Case Study of Clinical Trials With Decentralized Elements
    Huh, Ki Young
    Song, Ildae
    Kim, Yoonjin
    Park, Jiyeon
    Ryu, Hyunwook
    Koh, Jaeeun
    Yu, Kyung-Sang
    Kim, Kyung Hwan
    Lee, Seunghwan
    CTS-CLINICAL AND TRANSLATIONAL SCIENCE, 2025, 18 (03):
  • [36] Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
    Ehrett, Carl
    Hegde, Sudeep
    Andre, Kwame
    Liu, Dixizi
    Wilson, Timothy
    JMIR MEDICAL EDUCATION, 2024, 10
  • [37] FaultLines - Evaluating the Efficacy of Open-Source Large Language Models for Fault Detection in Cyber-Physical Systems
    Muehlburger, Herbert
    Wotawa, Franz
    2024 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2024, : 47 - 54
  • [38] Inductive Thematic Analysis of Healthcare Qualitative Interviews Using Open-Source Large Language Models: How Does it Compare to Traditional Methods?
    Mathis, Walter S.
    Zhao, Sophia
    Pratt, Nicholas
    Weleff, Jeremy
    De Paoli, Stefano
    SSRN,
  • [39] Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?
    Mathis, Walter S.
    Zhao, Sophia
    Pratt, Nicholas
    Weleff, Jeremy
    De Paoli, Stefano
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 255
  • [40] AI-assisted clinical trial recruitment using an open-source natural language processing workflow
    Kavnoudias, Helen
    Berry, Christopher
    Christian, Theo
    McKimm, Amy
    MacBean, Lachlan
    Zia, Adil
    Morris, Adam
    Librata, William
    Buensalido, Dominic
    Batstone, Joanna
    Law, Meng
    Woollett, Anne
    Jane, Stephen
    Teede, Helena
    ASIA-PACIFIC JOURNAL OF CLINICAL ONCOLOGY, 2022, 18 : 146 - 146