Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology

被引:1
作者
Huwiler, Jessica [1 ,2 ]
Oechslin, Luca [1 ]
Biaggi, Patric [1 ,2 ]
Tanner, Felix C. [2 ,3 ]
Wyss, Christophe Alain [1 ,2 ,3 ]
机构
[1] Heart Clin Zurich, Zurich, Switzerland
[2] Univ Zurich, Zurich, Switzerland
[3] Swiss Soc Cardiol, Basel, Switzerland
关键词
EUROPEAN EXAM; CHATGPT; HEALTH;
D O I
10.57187/s.3547
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
AIMS: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows. METHODS: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer. RESULTS: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91-99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44-53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65-72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset. CONCLUSIONS: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.
引用
收藏
页数:8
相关论文
共 23 条
[21]   ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? [J].
Skalidis, Ioannis ;
Cagnina, Aurelien ;
Luangphiphat, Wongsakorn ;
Mahendiran, Thabo ;
Muller, Olivier ;
Abbe, Emmanuel ;
Fournier, Stephane .
EUROPEAN HEART JOURNAL - DIGITAL HEALTH, 2023, 4 (03) :279-281
[22]   Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study [J].
Takagi, Soshi ;
Watari, Takashi ;
Erabi, Ayano ;
Sakaguchi, Kota .
JMIR MEDICAL EDUCATION, 2023, 9
[23]  
WEIZENBAUM J, 1966, COMMUN ACM, V9, P36, DOI 10.1145/357980.357991