Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

被引:18
作者
Huang, Ryan S. T. [1 ]
Lu, Kevin Jia Qi [2 ]
Meaney, Christopher [2 ]
Kemppainen, Joel [2 ]
Punnett, Angela [1 ,3 ]
Leung, Fok-Han [2 ]
机构
[1] Univ Toronto, Temerty Fac Med, 1 Kings Coll Cir, Toronto, ON M5S 1A8, Canada
[2] Univ Toronto, Dept Family & Community Med, Toronto, ON, Canada
[3] Hosp Sick Children, Div Haematol, Toronto, ON, Canada
来源
JMIR MEDICAL EDUCATION | 2023年 / 9卷
关键词
medical education; medical knowledge exam; artificial intelligence; AI; natural language processing; NLP; large language model; LLM; machine learning; ChatGPT; GPT-3.5; GPT-4; education; language model; education examination; testing; utility; family medicine; medical residents; test; community;
D O I
10.2196/50514
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: Large language model (LLM)-based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. Objective: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. Methods: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot's responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. Results: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). Conclusions: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.
引用
收藏
页数:9
相关论文
共 11 条
  • [1] Predictive Validity of Family Medicine Resident's Performance at Induction with Their Future Performance During Residency
    Andrades, Marie
    Yousuf, Naveed
    Sattar, Kiran Abdul
    Kausar, Samreen
    JCPSP-JOURNAL OF THE COLLEGE OF PHYSICIANS AND SURGEONS PAKISTAN, 2017, 27 (06): : 338 - 341
  • [2] Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study
    Goodings, Anthony James
    Kajitani, Sten
    Chhor, Allison
    Albakri, Ahmad
    Pastrak, Mila
    Kodancha, Megha
    Ives, Rowan
    Bin Lee, Yoo
    Kajitani, Kari
    JMIR MEDICAL EDUCATION, 2024, 10
  • [3] Educational needs assessment for health advocate role in family medicine residency training in Turkey: A Delphi study
    Demiroeren, Meral
    Baser, Duygu Ayhan
    EUROPEAN JOURNAL OF GENERAL PRACTICE, 2024, 30 (01)
  • [4] Family Medicine Resident Experience Toward Workplace-Based Assessment Form in Improving Clinical Teaching: An Exploratory Qualitative Study
    Alruqi, Ibrahim
    Al-Nasser, Sami
    Agha, Sajida
    ADVANCES IN MEDICAL EDUCATION AND PRACTICE, 2024, 15 : 37 - 46
  • [5] Family Medicine Resident Experience Toward Workplace-Based Assessment Form in Improving Clinical Teaching: An Exploratory Qualitative Study
    Alruqi, Ibrahim
    Al-Nasser, Sami
    Agha, Sajida
    ADVANCES IN MEDICAL EDUCATION AND PRACTICE, 2024, 15 : 37 - 46
  • [6] Scientific publications in internal medicine and family medicine: a comparative cross-sectional study in Swiss university hospitals
    Sebo, Paul
    de Lucia, Sylvain
    Vernaz, Nathalie
    FAMILY PRACTICE, 2021, 38 (03) : 299 - 305
  • [7] Relationship of Patient Self-Administered COPD Assessment Test to Physician Standard Assessment of Chronic Obstructive Pulmonary Disease in a Family Medicine Residency Training Program
    Burchette, Jessica E.
    Click, Ivy A.
    Johnson, Leigh
    Williams, S. Alicia
    Morgan, Brett Tyler
    JOURNAL OF PATIENT-CENTERED RESEARCH AND REVIEWS, 2019, 6 (03) : 210 - 215
  • [8] Effect of Workplace-Based Assessment Utilization as a Formative Assessment for Learning Among Family Medicine Postgraduates at the Faculty of Medicine, Menoufia University: A Prospective Study
    Alkalash, Safa H.
    Farag, Nagwa A.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (02)
  • [9] Performance of a brief geriatric evaluation compared to a comprehensive geriatric assessment for detection of geriatric syndromes in family medicine: a prospective diagnostic study
    Mueller, Yolanda K.
    Monod, Stefanie
    Locatelli, Isabella
    Buela, Christophe
    Cornuz, Jacques
    Senn, Nicolas
    BMC GERIATRICS, 2018, 18
  • [10] Performance of a brief geriatric evaluation compared to a comprehensive geriatric assessment for detection of geriatric syndromes in family medicine: a prospective diagnostic study
    Yolanda K. Mueller
    Stefanie Monod
    Isabella Locatelli
    Christophe Büla
    Jacques Cornuz
    Nicolas Senn
    BMC Geriatrics, 18