ChatGPT in medical school: how successful is AI in progress testing?

被引:66
作者
Friederichs, Hendrik [1 ,7 ]
Friederichs, Wolf Jonas [2 ]
Maerz, Maren [3 ,4 ,5 ,6 ]
机构
[1] Bielefeld Univ, Med Sch OWL, Bielefeld, Germany
[2] Rhein Westfal TH Aachen, Fac Mech Engn, Aachen, Germany
[3] Charite Univ med Berlin, Berlin, Germany
[4] Freien Univ Berlin, Berlin, Germany
[5] Humboldt Univ, Berlin, Germany
[6] Progress Test Med, Charite pl 1, Berlin, Germany
[7] Univ Bielefeld, AG 7 Med Educ, Med Fak OWL, Univ Str 25, D-33615 Bielefeld, Germany
关键词
Medical education; progress test; learning; artificial intelligence; machine learning; BASIC SCIENCE EXAMINATION; USMLE STEP 1; RISK LITERACY; NUMERACY; PERFORMANCE; STUDENTS; COMPREHENSION; ACQUISITION; CURRICULUM; STRENGTHS;
D O I
10.1080/10872981.2023.2220920
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background As generative artificial intelligence (AI), ChatGPT provides easy access to a wide range of information, including factual knowledge in the field of medicine. Given that knowledge acquisition is a basic determinant of physicians' performance, teaching and testing different levels of medical knowledge is a central task of medical schools. To measure the factual knowledge level of the ChatGPT responses, we compared the performance of ChatGPT with that of medical students in a progress test. Methods A total of 400 multiple-choice questions (MCQs) from the progress test in German-speaking countries were entered into ChatGPT's user interface to obtain the percentage of correctly answered questions. We calculated the correlations of the correctness of ChatGPT responses with behavior in terms of response time, word count, and difficulty of a progress test question. Results Of the 395 responses evaluated, 65.5% of the progress test questions answered by ChatGPT were correct. On average, ChatGPT required 22.8 s (SD 17.5) for a complete response, containing 36.2 (SD 28.1) words. There was no correlation between the time used and word count with the accuracy of the ChatGPT response (correlation coefficient for time rho = -0.08, 95% CI [-0.18, 0.02], t(393) = -1.55, p = 0.121; for word count rho = -0.03, 95% CI [-0.13, 0.07], t(393) = -0.54, p = 0.592). There was a significant correlation between the difficulty index of the MCQs and the accuracy of the ChatGPT response (correlation coefficient for difficulty: rho = 0.16, 95% CI [0.06, 0.25], t(393) = 3.19, p = 0.002). Conclusion ChatGPT was able to correctly answer two-thirds of all MCQs at the German state licensing exam level in Progress Test Medicine and outperformed almost all medical students in years 1-3. The ChatGPT answers can be compared with the performance of medical students in the second half of their studies.
引用
收藏
页数:9
相关论文
共 40 条
[1]   Rethinking health numeracy: A multidisciplinary literature review [J].
Ancker, Jessica S. ;
Kaufman, David .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2007, 14 (06) :713-721
[2]  
Anderson BL, 2014, NUMERICAL REASONING IN JUDGMENTS AND DECISION MAKING ABOUT HEALTH, P59
[3]   SCIENCE SINCE BABYLON - PRICE,DJD [J].
ANDERSON, DL .
TECHNOLOGY AND CULTURE, 1962, 3 (02) :175-177
[4]  
Anderson L. W., 2001, A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives
[5]  
[Anonymous], 2005, Studies in Educational Evaluation, DOI DOI 10.1016/J.STUEDUC.2005.05.004
[6]   Comparing academic performance of medical students in distributed learning sites: the McMaster experience [J].
Bianchi, Flavia ;
Stobbe, Karl ;
Eva, Kevin .
MEDICAL TEACHER, 2008, 30 (01) :67-71
[7]  
Bion R., 2023, ggradar: Create radar charts using ggplot2 Version 0.2
[8]  
Bloom BS., 1956, Handbook I: the cognitive domain
[9]  
Cokely ET, 2012, JUDGM DECIS MAK, V7, P25
[10]   American medical education 100 years after the Flexner report [J].
Cooke, Molly ;
Irby, David M. ;
Sullivan, William ;
Ludmerer, Kenneth M. .
NEW ENGLAND JOURNAL OF MEDICINE, 2006, 355 (13) :1339-1344