Evaluation of large language models as a diagnostic aid for complex medical cases

被引:4
作者
Rios-Hoyo, Alejandro [1 ]
Shan, Naing Lin [1 ]
Li, Anran [2 ]
Pearson, Alexander T. [2 ]
Pusztai, Lajos [1 ]
Howard, Frederick M. [2 ]
机构
[1] Yale Sch Med, Yale Canc Ctr, New Haven, CT 06510 USA
[2] Univ Chicago, Dept Med, Chicago, IL 60637 USA
关键词
large language model (LLM); ChatGPT; complex clinical cases; diagnosis; clinical case solving;
D O I
10.3389/fmed.2024.1380148
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals. Objective To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case. Design Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models. Results The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 (p < 0.0001). GPT4 was more frequently able to list the correct diagnosis as first (22% versus 20% with GPT3.5, p = 0.86), provide the correct diagnosis among the top three generated diagnoses (42% versus 24%, p = 0.075). GPT4 was better at providing the correct diagnosis, when the different diagnoses were classified into groups according to the medical specialty and include the correct diagnosis at any point in the differential list (68% versus 48%, p = 0.0063). GPT4 provided a differential list that was more similar to the list provided by the case discussants than GPT3.5 (Jaccard Similarity Index 0.22 versus 0.12, p = 0.001). Inclusion of the correct diagnosis in the generated differential was correlated with PubMed articles matching the diagnosis (OR 1.40, 95% CI 1.25-1.56 for GPT3.5, OR 1.25, 95% CI 1.13-1.40 for GPT4), but not with disease incidence. Conclusions and relevance The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.
引用
收藏
页数:6
相关论文
共 20 条
[1]   Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum [J].
Ayers, John W. ;
Poliak, Adam ;
Dredze, Mark ;
Leas, Eric C. ;
Zhu, Zechariah ;
Kelley, Jessica B. ;
Faix, Dennis J. ;
Goodman, Aaron M. ;
Longhurst, Christopher A. ;
Hogarth, Michael ;
Smith, Davey M. .
JAMA INTERNAL MEDICINE, 2023, 183 (06) :589-596
[2]   Case 24-2022: A 31-Year-Old Man with Perianal and Penile Ulcers, Rectal Pain, and Rash [J].
Basgoz, Nesli ;
Brown, Catherine M. ;
Smole, Sandra C. ;
Madoff, Lawrence C. ;
Biddinger, Paul D. ;
Baugh, Joshua J. ;
Shenoy, Erica S. .
NEW ENGLAND JOURNAL OF MEDICINE, 2022, 387 (06) :547-556
[3]  
Bhaimiya S., 2023, OpenAI cofounder Elon Musk said the non-profit he helped create is now focused on maximum-profit
[4]   Differential Diagnosis Generators: an Evaluation of Currently Available Computer Programs [J].
Bond, William F. ;
Schwartz, Linda M. ;
Weaver, Kevin R. ;
Levick, Donald ;
Giuliano, Michael ;
Graber, Mark L. .
JOURNAL OF GENERAL INTERNAL MEDICINE, 2012, 27 (02) :213-219
[5]   The future landscape of large language models in medicine [J].
Clusmann, Jan ;
Kolbinger, Fiona R. ;
Muti, Hannah Sophie ;
Carrero, Zunamys I. ;
Eckardt, Jan-Niklas ;
Laleh, Narmin Ghaffari ;
Loeffler, Chiara Maria Lavinia ;
Schwarzkopf, Sophie-Caroline ;
Unger, Michaela ;
Veldhuizen, Gregory P. ;
Wagner, Sophia J. ;
Kather, Jakob Nikolas .
COMMUNICATIONS MEDICINE, 2023, 3 (01)
[6]   Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers [J].
Gao, Catherine A. ;
Howard, Frederick M. ;
Markov, Nikolay S. ;
Dyer, Emma C. ;
Ramesh, Siddhi ;
Luo, Yuan ;
Pearson, Alexander T. .
NPJ DIGITAL MEDICINE, 2023, 6 (01)
[7]   Case records of the Massachusetts General Hospital - Continuing to learn from the patient [J].
Harris, NL .
NEW ENGLAND JOURNAL OF MEDICINE, 2003, 348 (22) :2252-2254
[8]  
Hirosawa Takanobu, 2023, Int J Environ Res Public Health, V20, DOI 10.3390/ijerph20043378
[9]   Differential diagnosis checklists reduce diagnostic error differentially: A randomised experiment [J].
Kaemmer, Juliane E. ;
Schauber, Stefan K. ;
Hautz, Stefanie C. ;
Stroben, Fabian ;
Hautz, Wolf E. .
MEDICAL EDUCATION, 2021, 55 (10) :1172-1182
[10]   Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge [J].
Kanjee, Zahir ;
Crowe, Byron ;
Rodman, Adam .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2023, 330 (01) :78-80