LLM-based automatic short answer grading in undergraduate medical education

被引:8
作者
Grevisse, Christian [1 ]
机构
[1] Univ Luxembourg, Dept Life Sci & Med, 6 Ave Fonte, L-4364 Esch Sur Alzette, Luxembourg
关键词
Automatic short answer grading; Medical education; Large language models; GPT-4; Gemini;
D O I
10.1186/s12909-024-06026-5
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
BackgroundMultiple choice questions are heavily used in medical education assessments, but rely on recognition instead of knowledge recall. However, grading open questions is a time-intensive task for teachers. Automatic short answer grading (ASAG) has tried to fill this gap, and with the recent advent of Large Language Models (LLM), this branch has seen a new momentum.MethodsWe graded 2288 student answers from 12 undergraduate medical education courses in 3 languages using GPT-4 and Gemini 1.0 Pro.ResultsGPT-4 proposed significantly lower grades than the human evaluator, but reached low rates of false positives. The grades of Gemini 1.0 Pro were not significantly different from the teachers'. Both LLMs reached a moderate agreement with human grades, and a high precision for GPT-4 among answers considered fully correct. A consistent grading behavior could be determined for high-quality keys. A weak correlation was found wrt. the length or language of student answers. There is a risk of bias if the LLM knows the human grade a priori.ConclusionsLLM-based ASAG applied to medical education still requires human oversight, but time can be spared on the edge cases, allowing teachers to focus on the middle ones. For Bachelor-level medical education questions, the training knowledge of LLMs seems to be sufficient, fine-tuning is thus not necessary.
引用
收藏
页数:16
相关论文
共 26 条
[1]   Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [J].
Abdelnabi, Sahar ;
Greshake, Kai ;
Mishra, Shailesh ;
Endres, Christoph ;
Holz, Thorsten ;
Fritz, Mario .
PROCEEDINGS OF THE 16TH ACM WORKSHOP ON ARTIFICIAL INTELLIGENCE AND SECURITY, AISEC 2023, 2023, :79-90
[2]   Revolutionizing education with AI: Exploring the transformative potential of ChatGPT [J].
Adiguzel, Tufan ;
Kaya, Mehmet Haldun ;
Cansu, Fatih Kursat .
CONTEMPORARY EDUCATIONAL TECHNOLOGY, 2023, 15 (03)
[3]   Twelve tips for introducing very short answer questions (VSAQs) into your medical curriculum [J].
Bala, Laksha ;
Westacott, Rachel J. ;
Brown, Celia ;
Sam, Amir H. .
MEDICAL TEACHER, 2023, 45 (04) :360-367
[4]   Uncovering students' misconceptions by assessment of their written questions [J].
Bekkink, Marleen Olde ;
Donders, A. R. T. Rogier ;
Kooloos, Jan G. ;
de Waal, Rob M. W. ;
Ruiter, Dirk J. .
BMC MEDICAL EDUCATION, 2016, 16
[5]  
Bloom B. S., 1956, Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain, DOI DOI 10.1300/J104V03N01_03
[6]   The Eras and Trends of Automatic Short Answer Grading [J].
Burrows, Steven ;
Gurevych, Iryna ;
Stein, Benno .
INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2015, 25 (01) :60-117
[7]  
Chang LH, 2024, AAAI CONF ARTIF INTE, P23173
[8]   Improving Automated Evaluation of Student Text Responses Using GPT-3.5 for Text Data Augmentation [J].
Cochran, Keith ;
Cohn, Clayton ;
Rouet, Jean Francois ;
Hastings, Peter .
ARTIFICIAL INTELLIGENCE IN EDUCATION, AIED 2023, 2023, 13916 :217-228
[9]   Exploring Automatic Short Answer Grading as a Tool to Assist in Human Rating [J].
Condor, Aubrey .
ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 :74-79
[10]  
Fagbohun O., 2024, Journal of Artifical Intelligence and Machine Learning Data Science, V2, P1, DOI [10.51219/JAIMLD/oluwole-fagbohun/19, DOI 10.51219/JAIMLD/OLUWOLE-FAGBOHUN/19]