Fine-tuning ChatGPT for automatic scoring

被引:0
作者
Latif E. [1 ]
Zhai X. [1 ]
机构
[1] AI4STEM Education Center, University of Georgia, Athens, GA
来源
Computers and Education: Artificial Intelligence | 2024年 / 6卷
基金
美国国家科学基金会;
关键词
Automatic scoring; BERT; Education; Finetune; GPT-3.5; Large language model (LLM);
D O I
10.1016/j.caeai.2024.100210
中图分类号
学科分类号
摘要
This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring student written constructed responses using example assessment tasks in science education. The application of ChatGPT in research and academic fields has greatly enhanced productivity and efficiency. Recent studies on ChatGPT based on OpenAI's generative model GPT-3.5 proved its superiority in predicting the natural language with high accuracy and human-like responses. GPT-3.5 has been trained over enormous online language materials such as journals and Wikipedia; however, direct usage of pre-trained GPT-3.5 is insufficient for automatic scoring as students do not utilize the same language as journals or Wikipedia, and contextual information is required for accurate scoring. All of these imply that a fine-tuning of a domain-specific model using data for specific tasks can enhance model performance. In this study, we fine-tuned GPT-3.5 on six assessment tasks with a diverse dataset of middle-school and high-school student responses and expert scoring. The six tasks comprise two multi-label and four multi-class assessment tasks. We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT. The results show that in-domain training corpora constructed from science questions and responses for BERT achieved average accuracy = 0.838, SD = 0.069. GPT-3.5 shows a remarkable average increase (9.1%) in automatic scoring accuracy (mean = 9.15, SD = 0.042) for the six tasks, p =0.001 < 0.05. Specifically, for each of the two multi-label tasks (item 1 with 5 labels; item 2 with 10 labels), GPT-3.5 achieved significantly higher scoring accuracy than BERT across all the labels, with the second item achieving a 7.1% increase. The average scoring increase for the four multi-class items for GPT-3.5 was 10.6% compared to BERT. Our study confirmed the effectiveness of fine-tuned GPT-3.5 for automatic scoring of student responses on domain-specific data in education with high accuracy. We have released fine-tuned models for public use and community engagement. © 2024 The Author(s)
引用
收藏
相关论文
共 50 条
  • [1] Fine-Tuning ChatGPT for Automatic Scoring of Written Scientific Explanations in Chinese
    Yang, Jie
    Latif, Ehsan
    He, Yuze
    Zhai, Xiaoming
    JOURNAL OF SCIENCE EDUCATION AND TECHNOLOGY, 2025,
  • [2] Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data
    Valizadehaslani, Taha
    Shi, Yiwen
    Wang, Jing
    Ren, Ping
    Zhang, Yi
    Hu, Meng
    Zhao, Liang
    Liang, Hualou
    NEUROCOMPUTING, 2024, 592
  • [3] Emerging trends: A gentle introduction to fine-tuning
    Church, Kenneth Ward
    Chen, Zeyu
    Ma, Yanjun
    NATURAL LANGUAGE ENGINEERING, 2021, 27 (06) : 763 - 778
  • [4] Transfer fine-tuning of BERT with phrasal paraphrases
    Arase, Yuki
    Tsujii, Junichi
    COMPUTER SPEECH AND LANGUAGE, 2021, 66
  • [5] SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT
    Huang, Wen-Chin
    Wu, Chia-Hua
    Luo, Shang-Bao
    Chen, Kuan-Yu
    Wang, Hsin-Min
    Toda, Tomoki
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7343 - 7347
  • [6] Efficient Fine-Tuning of BERT Models on the Edge
    Vucetic, Danilo
    Tayaranian, Mohammadreza
    Ziaeefard, Maryam
    Clark, James J.
    Meyer, Brett H.
    Gross, Warren J.
    2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 1838 - 1842
  • [7] A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension
    Cai, Jie
    Zhu, Zhengzhou
    Nie, Ping
    Liu, Qian
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 1665 - 1668
  • [8] Improve Performance of Fine-tuning Language Models with Prompting
    Yang, Zijian Gyozo
    Ligeti-Nagy, Noenn
    INFOCOMMUNICATIONS JOURNAL, 2023, 15 : 62 - 68
  • [9] Fine-tuning language models to recognize semantic relations
    Dmitri Roussinov
    Serge Sharoff
    Nadezhda Puchnina
    Language Resources and Evaluation, 2023, 57 : 1463 - 1486
  • [10] Fine-tuning language models to recognize semantic relations
    Roussinov, Dmitri
    Sharoff, Serge
    Puchnina, Nadezhda
    LANGUAGE RESOURCES AND EVALUATION, 2023, 57 (04) : 1463 - 1486