Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology

被引：1

作者：

Huwiler, Jessica ^{[1
,2
]}

Oechslin, Luca ^{[1
]}

Biaggi, Patric ^{[1
,2
]}

Tanner, Felix C. ^{[2
,3
]}

Wyss, Christophe Alain ^{[1
,2
,3
]}

机构：

[1] Heart Clin Zurich, Zurich, Switzerland

[2] Univ Zurich, Zurich, Switzerland

[3] Swiss Soc Cardiol, Basel, Switzerland

来源：

SWISS MEDICAL WEEKLY | 2024年 / 154卷

关键词：

EUROPEAN EXAM; CHATGPT; HEALTH;

D O I：

10.57187/s.3547

中图分类号：

R5 [内科学];

学科分类号：

1002 ; 100201 ;

摘要：

AIMS: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows. METHODS: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer. RESULTS: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91-99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44-53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65-72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset. CONCLUSIONS: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.

引用

页数：8

共 23 条

[21] ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? [J].

Skalidis, Ioannis ;

Cagnina, Aurelien ;

Luangphiphat, Wongsakorn ;

Mahendiran, Thabo ;

Muller, Olivier ;

Abbe, Emmanuel ;

Fournier, Stephane .

EUROPEAN HEART JOURNAL - DIGITAL HEALTH, 2023, 4 (03) :279-281

[22] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study [J].

Takagi, Soshi ;

Watari, Takashi ;

Erabi, Ayano ;

Sakaguchi, Kota .

JMIR MEDICAL EDUCATION, 2023, 9

[23]

WEIZENBAUM J, 1966, COMMUN ACM, V9, P36, DOI 10.1145/357980.357991

← 1 2 3 →