Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study

被引:34
|
作者
Fraser, Hamish [1 ,2 ,7 ]
Crossland, Daven [1 ,3 ]
Bacher, Ian [1 ]
Ranney, Megan [4 ]
Madsen, Tracy [3 ,5 ]
Hilliard, Ross [6 ]
机构
[1] Brown Univ, Brown Ctr Biomed Informat, Warren Alpert Med Sch, Providence, RI USA
[2] Brown Univ, Dept Hlth Serv Policy & Practice, Sch Publ Hlth, Providence, RI USA
[3] Brown Univ, Sch Publ Hlth, Dept Epidemiol, Providence, RI USA
[4] Yale Univ, Sch Publ Hlth, New Haven, CT USA
[5] Brown Univ, Warren Alpert Med Sch, Dept Emergency Med, Providence, RI USA
[6] Maine Med Ctr, Dept Internal Med, Portland, ME USA
[7] Brown Univ, Brown Ctr Biomed Informat, Warren Alpert Med Sch, 233 Richmond St, Providence, RI 02912 USA
来源
JMIR MHEALTH AND UHEALTH | 2023年 / 11卷
基金
美国医疗保健研究与质量局;
关键词
diagnosis; triage; symptom checker; emergency patient; ChatGPT; LLM; diagnose; self-diagnose; self-diagnosis; app; application; language model; accuracy; ChatGPT-3.5; ChatGPT-4.0; emergency; machine learning;
D O I
10.2196/49995
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients.Objective: The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews.Methods: We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated "unsafe" or "too cautious."Results: Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%).Conclusions: ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.
引用
收藏
页数:10
相关论文
共 4 条
  • [1] Triage accuracy of online symptom checkers for Accident and Emergency Department patients
    Yu, Stephanie Wing Yin
    Ma, Andre
    Tsang, Vivian Hiu Man
    Chung, Lulu Suet Wing
    Leung, Siu-Chung
    Leung, Ling-Pong
    HONG KONG JOURNAL OF EMERGENCY MEDICINE, 2020, 27 (04) : 217 - 222
  • [2] ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis
    Hoppe, John Michael
    Auer, Matthias K.
    Strueven, Anna
    Massberg, Steffen
    Stremmel, Christopher
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [3] Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study
    Fraser, Hamish S. F.
    Cohan, Gregory
    Koehler, Christopher
    Anderson, Jared
    Lawrence, Alexis
    Patena, John
    Bacher, Ian
    Ranney, Megan L.
    JMIR MHEALTH AND UHEALTH, 2022, 10 (09):
  • [4] Comparison of Two Symptom Checkers (Ada and Symptoma) in the Emergency Department: Randomized, Crossover, Head-to-Head, Double-Blinded Study
    Knitza, Johannes
    Hasanaj, Ragip
    Beyer, Jonathan
    Ganzer, Franziska
    Slagman, Anna
    Bolanaki, Myrto
    Napierala, Hendrik
    Schmieding, Malte L.
    Al-Zaher, Nizam
    Orlemann, Till
    Muehlensiepen, Felix
    Greenfield, Julia
    Vuillerme, Nicolas
    Kuhn, Sebastian
    Schett, Georg
    Achenbach, Stephan
    Dechant, Katharina
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26