ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis

被引:8
|
作者
Hoppe, John Michael [1 ]
Auer, Matthias K. [1 ]
Strueven, Anna [2 ,3 ]
Massberg, Steffen [2 ,3 ]
Stremmel, Christopher [2 ,3 ]
机构
[1] LMU Univ Hosp, Dept Med 4, Munich, Germany
[2] LMU Univ Hosp, Dept Med 1, Marchioninistr 15, D-81377 Munich, Germany
[3] LMU Univ Hosp, German Ctr Cardiovasc Res, Munich Heart Alliance Partner Site, Munich, Germany
关键词
emergency department; diagnosis; accuracy; artificial intelligence; ChatGPT; internal medicine; AI; natural language processing; NLP; emergency medicine triage; triage; physicians; physician; diagnostic accuracy; OpenAI;
D O I
10.2196/56110
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: OpenAI's ChatGPT is a pioneering artificial intelligence (AI) in the field of natural language processing, and it holds significant potential in medicine for providing treatment advice. Additionally, recent studies have demonstrated promising results using ChatGPT for emergency medicine triage. However, its diagnostic accuracy in the emergency department (ED) has not yet been evaluated. Objective: This study compares the diagnostic accuracy of ChatGPT with GPT-3.5 and GPT-4 and primary treating resident physicians in an ED setting. Methods: Among 100 adults admitted to our ED in January 2023 with internal medicine issues, the diagnostic accuracy was assessed by comparing the diagnoses made by ED resident physicians and those made by ChatGPT with GPT-3.5 or GPT-4 against the final hospital discharge diagnosis, using a point system for grading accuracy. Results: The study enrolled 100 patients with a median age of 72 (IQR 58.5-82.0) years who were admitted to our internal medicine ED primarily for cardiovascular, endocrine, gastrointestinal, or infectious diseases. GPT-4 outperformed both GPT-3.5 ( P <.001) and ED resident physicians ( P =.01) in diagnostic accuracy for internal medicine emergencies. Furthermore, across various disease subgroups, GPT-4 consistently outperformed GPT-3.5 and resident physicians. It demonstrated significant superiority in cardiovascular (GPT-4 vs ED physicians: P =.03) and endocrine or gastrointestinal diseases (GPT-4 vs GPT-3.5: P =.01). However, in other categories, the differences were not statistically significant. Conclusions: In this study, which compared the diagnostic accuracy of GPT-3.5, GPT-4, and ED resident physicians against a discharge diagnosis gold standard, GPT-4 outperformed both the resident physicians and its predecessor, GPT-3.5. Despite the retrospective design of the study and its limited sample size, the results underscore the potential of AI as a supportive diagnostic tool in ED settings.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study
    Fraser, Hamish
    Crossland, Daven
    Bacher, Ian
    Ranney, Megan
    Madsen, Tracy
    Hilliard, Ross
    JMIR MHEALTH AND UHEALTH, 2023, 11
  • [2] Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4
    Lahat, Adi
    Sharif, Kassem
    Zoabi, Narmin
    Patt, Yonatan Shneor
    Sharif, Yousra
    Fisher, Lior
    Shani, Uria
    Arow, Mohamad
    Levin, Roni
    Klang, Eyal
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [3] The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists
    Gunay, Serkan
    Ozturk, Ahmet
    Yigit, Yavuz
    AMERICAN JOURNAL OF EMERGENCY MEDICINE, 2024, 84 : 68 - 73
  • [4] Usefulness of the large language model ChatGPT (GPT-4) as a diagnostic tool and information source in dermatology
    Nielsen, Jacob P. S.
    Gronhoj, Christian
    Skov, Lone
    Gyldenlove, Mette
    JEADV CLINICAL PRACTICE, 2024, 3 (05): : 1570 - 1575
  • [5] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Jamil S. Samaan
    Nithya Rajeev
    Wee Han Ng
    Nitin Srinivasan
    Jonathan A. Busam
    Yee Hui Yeo
    Kamran Samakar
    Obesity Surgery, 2024, 34 : 1987 - 1989
  • [6] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Samaan, Jamil S.
    Rajeev, Nithya
    Ng, Wee Han
    Srinivasan, Nitin
    Busam, Jonathan A.
    Yeo, Yee Hui
    Samakar, Kamran
    OBESITY SURGERY, 2024, 34 (05) : 1987 - 1989
  • [7] Accuracy of ChatGPT-3.5 and GPT-4 in diagnosing clinical scenarios in dermatology involving skin of color
    Qureshi, Simal
    Alli, Sauliha Rabia
    Ogunyemi, Boluwaji
    INTERNATIONAL JOURNAL OF DERMATOLOGY, 2024, 63 (11) : e353 - e354
  • [8] ChatGPT and GPT-4: utilities in the legal sector, functioning, limitations and risks of foundational models
    Gomez, Francisco Julio Dosal
    Galende, Judith Nieto
    TECNOLOGIA CIENCIA Y EDUCACION, 2024, (28): : 45 - 88
  • [9] Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
    Ali, Rohaid
    Tang, Oliver Y.
    Connolly, Ian D.
    Sullivan, Patricia L. Zadnik
    Shin, John H.
    Fridley, Jared S.
    Asaad, Wael F.
    Cielo, Deus
    Oyelese, Adetokunbo A.
    Doberstein, Curtis E.
    Gokaslan, Ziya L.
    Telfeian, Albert E.
    NEUROSURGERY, 2023, 93 (06) : 1353 - 1365
  • [10] Using Natural Language Processing (GPT-4) for ComputedTomography Image Analysis of Cerebral Hemorrhages inRadiology:Retrospective Analysis
    Zhang, Daiwen
    Ma, Zixuan
    Gong, Ru
    Lian, Liangliang
    Li, Yanzhuo
    He, Zhenghui
    Han, Yuhan
    Hui, Jiyuan
    Huang, Jialin
    Jiang, Jiyao
    Weng, Weiji
    Feng, Junfeng
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26