Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam

被引:3
作者
Builoff, Valerie [1 ]
Shanbhag, Aakash [1 ,2 ]
Miller, Robert J. H. [1 ,3 ]
Dey, Damini [1 ]
Liang, Joanna X. [1 ]
Flood, Kathleen [4 ]
Bourque, Jamieson M. [5 ]
Chareonthaitawee, Panithaya [6 ]
Phillips, Lawrence M. [7 ]
Slomka, Piotr J. [1 ]
机构
[1] Cedars Sinai Med Ctr, Dept Med, Div Artificial Intelligence Med, Imaging & Biomed Sci, Los Angeles, CA 90048 USA
[2] Univ Southern Calif, Signal & Image Proc Inst, Ming Hsieh Dept Elect & Comp Engn, Los Angeles, CA USA
[3] Univ Calgary, Dept Cardiac Sci, Calgary, AB, Canada
[4] Amer Soc Nucl Cardiol, Fairfax, VA USA
[5] Univ Virginia Hlth Syst, Div Cardiovasc Med & Radiol, Charlottesville, VA USA
[6] Mayo Clin, Dept Cardiovasc Med, Rochester, MN USA
[7] NYU Grossman Sch Med, Dept Med, Leon H Charney Div Cardiol, New York, NY USA
基金
美国国家卫生研究院;
关键词
Nuclear cardiology board exam; Large language models; GPT; Cardiovascular imaging questions; PERFORMANCE;
D O I
10.1016/j.nuclcard.2024.102089
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions. Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4%- 58.0%), 40.5% (39.9%- 42.9%), 60.7% (59.5% 61.3%), and 63.1% (62.5%e64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all). Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT4o shows potential to support physicians in answering text-based clinical questions.
引用
收藏
页数:11
相关论文
共 27 条
[1]  
2023, Arxiv, DOI arXiv:2303.08774
[2]   Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations [J].
Ali, Rohaid ;
Tang, Oliver Y. ;
Connolly, Ian D. ;
Sullivan, Patricia L. Zadnik ;
Shin, John H. ;
Fridley, Jared S. ;
Asaad, Wael F. ;
Cielo, Deus ;
Oyelese, Adetokunbo A. ;
Doberstein, Curtis E. ;
Gokaslan, Ziya L. ;
Telfeian, Albert E. .
NEUROSURGERY, 2023, 93 (06) :1353-1365
[3]  
[Anonymous], 2018, Nuclear cardiology examination content outline sum- mary
[4]   Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations [J].
Bhayana, Rajesh ;
Krishna, Satheesh ;
Bleakney, Robert R. .
RADIOLOGY, 2023, 307 (05)
[5]  
Blodgett SL, 2020, Arxiv, DOI [arXiv:2005.14050, 10.48550/arXiv.2005.14050, DOI 10.48550/ARXIV.2005.14050]
[6]   Performance of Google's Artificial Intelligence Chatbot "Bard" (Now "Gemini") on Ophthalmology Board Exam Practice Questions [J].
Botross, Monica ;
Mohammadi, Seyed Omid ;
Montgomery, Kendall ;
Crawford, Courtney .
CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)
[7]   Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios [J].
Cascella, Marco ;
Montomoli, Jonathan ;
Bellini, Valentina ;
Bignami, Elena .
JOURNAL OF MEDICAL SYSTEMS, 2023, 47 (01)
[8]   ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam [J].
Fiedler, Benjamin ;
Azua, Eric N. ;
Phillips, Todd ;
Ahmed, Adil Shahzad .
JOURNAL OF SHOULDER AND ELBOW SURGERY, 2024, 33 (09) :1888-1893
[9]  
GoogleAI, 2024, Gemini API additional terms of service
[10]  
Hetz MJ, 2024, Arxiv, DOI [arXiv:2406.01428, DOI 10.48550/ARXIV.2406.01428, 10.48550/arXiv.2406.01428]