Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions

被引:13
作者
Agarwal, Mayank [1 ]
Goswami, Ayan [2 ]
Sharma, Priyanka [3 ]
机构
[1] All India Inst Med Sci, Physiol, Raebareli, India
[2] Santiniketan Med Coll, Physiol, Bolpur, India
[3] Sharda Univ, Sch Med Sci & Res, Physiol, Greater Noida, India
关键词
physiology; medical education; multiple choice questions; large language models; claude; chatgpt; artificial intelligence;
D O I
10.7759/cureus.46222
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background Generative artificial intelligence (AI) systems such as ChatGPT-3.5 and Claude-2 may assist in explaining complex medical science topics. A few studies have shown that AI can solve complicated physiology problems that require critical thinking and analysis. However, further studies are required to validate the effectiveness of AI in answering conceptual multiple-choice questions (MCQs) in human physiology.Objective This study aimed to evaluate and compare the proficiency of ChatGPT-3.5 and Claude-2 in answering and explaining a curated set of MCQs in medical physiology.Methods In this cross-sectional study, a set of 55 MCQs from 10 competencies of medical physiology was purposefully constructed that required comprehension, problem-solving, and analytical skills to solve them. The MCQs and a structured prompt for response generation were presented to ChatGPT-3.5 and Claude-2. The explanations provided by both AI systems were documented in an Excel spreadsheet. All three authors subjected these explanations to a rating process using a scale of 0 to 3. A rating of 0 was assigned to an incorrect, 1 to a partially correct, 2 to a correct explanation with some aspects missing, and 3 to a perfectly correct explanation. Both AI models were evaluated for their ability to choose the correct answer (option) and provide clear and comprehensive explanations of the MCQs. The Mann-Whitney U test was used to compare AI responses. The Fleiss multi-rater kappa (kappa) was used to determine the score agreement among the three raters. The statistical significance level was decided at P <= 0.05.Results Claude-2 answered 40 MCQs correctly, which was significantly higher than the 26 correct responses from ChatGPT-3.5. The rating distribution for the explanations generated by Claude-2 was significantly higher than that of ChatGPT-3.5. The kappa values were 0.804 and 0.818 for Claude-2 and ChatGPT-3.5, respectively.Conclusion In terms of answering and elucidating conceptual MCQs in medical physiology, Claude-2 surpassed ChatGPT-3.5. However, accessing Claude-2 from India requires the use of a virtual private network, which may raise security concerns.
引用
收藏
页数:18
相关论文
共 21 条
  • [1] Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology
    Agarwal, Mayank
    Sharma, Priyanka
    Goswami, Ayan
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)
  • [2] The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research
    Alqahtani, Tariq
    Badreldin, Hisham A.
    Alrashed, Mohammed
    Alshaya, Abdulrahman I.
    Alghamdi, Sahar S.
    bin Saleh, Khalid
    Alowais, Shuroug A.
    Alshaya, Omar A.
    Rahman, Ishrat
    Al Yami, Majed S.
    Albekairy, Abdulkareem M.
    [J]. RESEARCH IN SOCIAL & ADMINISTRATIVE PHARMACY, 2023, 19 (08) : 1236 - 1242
  • [3] The potential scope of a humanoid robot in anatomy education: a review of a unique proposal
    Asghar, Adil
    Patra, Apurba
    Ravi, Kumar Satish
    [J]. SURGICAL AND RADIOLOGIC ANATOMY, 2022, 44 (10) : 1309 - 1317
  • [4] Assessing the Efficacy of ChatGPT in Solving Questions Based on the Core Concepts in Physiology
    Banerjee, Arijita
    Ahmad, Aquil
    Bhalla, Payal
    Goyal, Kavita
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [5] Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology
    Dhanvijay, Anup Kumar D.
    Pinjar, Mohammed Jaffer
    Dhokane, Nitin
    Sorte, Smita R.
    Kumari, Amita
    Mondal, Himel
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [6] ChatGPT in medical school: how successful is AI in progress testing?
    Friederichs, Hendrik
    Friederichs, Wolf Jonas
    Maerz, Maren
    [J]. MEDICAL EDUCATION ONLINE, 2023, 28 (01):
  • [7] Gilson Aidan, 2023, JMIR Med Educ, V9, pe45312, DOI 10.2196/45312
  • [8] Interactive Learning: Online Audience Response System and Multiple Choice Questions Improve Student Participation in Lectures
    Goyal, Manish
    Agarwal, Mayank
    Goel, Arun
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (07)
  • [9] Heston TF, 2023, Preprints, DOI [10.20944/preprints202307.0813.v1, 10.20944/preprints202307.0813.v1, DOI 10.20944/PREPRINTS202307.0813.V1]
  • [10] Hussain J, 2023, SSRN Electron J, DOI [10.2139/ssrn.4478285, DOI 10.2139/SSRN.4478285]