Fine-tuning ChatGPT for automatic scoring

被引：0

作者：

Latif E. ^{[1
]}

Zhai X. ^{[1
]}

机构：

[1] AI4STEM Education Center, University of Georgia, Athens, GA

来源：

Computers and Education: Artificial Intelligence | 2024年 / 6卷

基金：

美国国家科学基金会;

关键词：

Automatic scoring; BERT; Education; Finetune; GPT-3.5; Large language model (LLM);

D O I：

10.1016/j.caeai.2024.100210

中图分类号：

学科分类号：

摘要：

This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring student written constructed responses using example assessment tasks in science education. The application of ChatGPT in research and academic fields has greatly enhanced productivity and efficiency. Recent studies on ChatGPT based on OpenAI's generative model GPT-3.5 proved its superiority in predicting the natural language with high accuracy and human-like responses. GPT-3.5 has been trained over enormous online language materials such as journals and Wikipedia; however, direct usage of pre-trained GPT-3.5 is insufficient for automatic scoring as students do not utilize the same language as journals or Wikipedia, and contextual information is required for accurate scoring. All of these imply that a fine-tuning of a domain-specific model using data for specific tasks can enhance model performance. In this study, we fine-tuned GPT-3.5 on six assessment tasks with a diverse dataset of middle-school and high-school student responses and expert scoring. The six tasks comprise two multi-label and four multi-class assessment tasks. We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT. The results show that in-domain training corpora constructed from science questions and responses for BERT achieved average accuracy = 0.838, SD = 0.069. GPT-3.5 shows a remarkable average increase (9.1%) in automatic scoring accuracy (mean = 9.15, SD = 0.042) for the six tasks, p =0.001 < 0.05. Specifically, for each of the two multi-label tasks (item 1 with 5 labels; item 2 with 10 labels), GPT-3.5 achieved significantly higher scoring accuracy than BERT across all the labels, with the second item achieving a 7.1% increase. The average scoring increase for the four multi-class items for GPT-3.5 was 10.6% compared to BERT. Our study confirmed the effectiveness of fine-tuned GPT-3.5 for automatic scoring of student responses on domain-specific data in education with high accuracy. We have released fine-tuned models for public use and community engagement. © 2024 The Author(s)

引用

共 50 条

[1] Fine-Tuning ChatGPT for Automatic Scoring of Written Scientific Explanations in Chinese
Yang, Jie
Latif, Ehsan
He, Yuze
Zhai, Xiaoming
JOURNAL OF SCIENCE EDUCATION AND TECHNOLOGY, 2025,
[2] Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data
Valizadehaslani, Taha
Shi, Yiwen
Wang, Jing
Ren, Ping
Zhang, Yi
Hu, Meng
Zhao, Liang
Liang, Hualou
NEUROCOMPUTING, 2024, 592
[3] Emerging trends: A gentle introduction to fine-tuning
Church, Kenneth Ward
Chen, Zeyu
Ma, Yanjun
NATURAL LANGUAGE ENGINEERING, 2021, 27 (06) : 763 - 778
[4] Transfer fine-tuning of BERT with phrasal paraphrases
Arase, Yuki
Tsujii, Junichi
COMPUTER SPEECH AND LANGUAGE, 2021, 66
[5] SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT
Huang, Wen-Chin
Wu, Chia-Hua
Luo, Shang-Bao
Chen, Kuan-Yu
Wang, Hsin-Min
Toda, Tomoki
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7343 - 7347
[6] Efficient Fine-Tuning of BERT Models on the Edge
Vucetic, Danilo
Tayaranian, Mohammadreza
Ziaeefard, Maryam
Clark, James J.
Meyer, Brett H.
Gross, Warren J.
2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 1838 - 1842
[7] A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension
Cai, Jie
Zhu, Zhengzhou
Nie, Ping
Liu, Qian
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 1665 - 1668
[8] Improve Performance of Fine-tuning Language Models with Prompting
Yang, Zijian Gyozo
Ligeti-Nagy, Noenn
INFOCOMMUNICATIONS JOURNAL, 2023, 15 : 62 - 68
[9] Fine-tuning language models to recognize semantic relations
Dmitri Roussinov
Serge Sharoff
Nadezhda Puchnina
Language Resources and Evaluation, 2023, 57 : 1463 - 1486
[10] Fine-tuning language models to recognize semantic relations
Roussinov, Dmitri
Sharoff, Serge
Puchnina, Nadezhda
LANGUAGE RESOURCES AND EVALUATION, 2023, 57 (04) : 1463 - 1486

← 1 2 3 4 5 →