Automated Scoring of Creative Problem Solving With Large Language Models: A Comparison of Originality and Quality Ratings

被引：5

作者：

Luchini, Simone A. ^{[1
]}

Maliakkal, Nadine T. ^{[2
]}

Distefano, Paul V. ^{[1
]}

Laverghetta Jr, Antonio ^{[1
]}

Patterson, John D. ^{[1
]}

Beaty, Roger E. ^{[1
]}

Reiter-Palmon, Roni ^{[2
]}

机构：

[1] Penn State Univ, Dept Psychol, 140 Moore Bldg, University Pk, PA 16802 USA

[2] Univ Nebraska, Dept Psychol, Omaha,Nebraska, Omaha, NE USA

来源：

PSYCHOLOGY OF AESTHETICS CREATIVITY AND THE ARTS | 2025年

基金：

美国国家科学基金会;

关键词：

automated scoring; creativity; creative problem-solving; large language models; naturalistic creativity assessment; DIVERGENT THINKING TESTS; PROBLEM CONSTRUCTION; SELF-PERCEPTIONS; EXPERTISE;

D O I：

10.1037/aca0000736

中图分类号：

C [社会科学总论];

学科分类号：

03 ; 0303 ;

摘要：

Creative problem solving is a naturalistic form of creative thinking involving the generation of solutions that are not only original but also of high quality (i.e., plausible and effective). Past work has shown that large language models (LLMs) can predict human originality ratings of responses to creativity tests. We extend this work to creative problem solving, examining whether both originality and quality can be automatically scored for a naturalistic creativity task. We gathered data from 10 studies, amounting to 3,243 participants who completed different items of the creative problem-solving task (CPST). We then fine-tuned two open-source LLMs, RoBERTa and GPT-2, and few-shot prompted two separate LLMs, Claude and Llama, to predict human ratings of originality and quality on the CPST. We compared LLM performance to two other scoring methods: elaboration and semantic distance. We found that RoBERTa and GPT-2 models predict human ratings of solution quality (RoBERTa, r = .83; GPT-2, r = .83) better than solution originality (RoBERTa, r = .79; GPT-2, r = .80). Moreover, we found that both models outperformed elaboration and semantic distance and generalized to new CPST items not in their training set, with stronger predictions for quality than originality on the holdout-prompt set. Few-shot prompting was less effective than fine-tuning at predicting both originality (r = .66-.11) and quality (r = .62-.26). We show for the first time that naturalistic creativity tasks can be automatically scored for both originality and quality. Open access is provided to the models and training data.

引用

页数：15

共 90 条

[1] Applying Automated Originality Scoring to the Verbal Form of Torrance Tests of Creative Thinking [J].

Acar, Selcuk ;

Berthiaume, Kelly ;

Grajzel, Katalin ;

Dumas, Denis ;

Flemister, Charles Tedd ;

Organisciak, Peter .

GIFTED CHILD QUARTERLY, 2021, :3-17

[2]

Alec RadfordKarthik Narasimhan., 2018, IMPROVING LANGUAGE U

[3]

Ali Rohaid, 2023, Neurosurgery, V93, P1090, DOI [10.1227/neu.0000000000002551, 10.1227/neu.0000000000002551]

[4]

Amabile T., 1996, CREATIVITY CONTEXT U

[5] SOCIAL-PSYCHOLOGY OF CREATIVITY - A CONSENSUAL ASSESSMENT TECHNIQUE [J].

AMABILE, TM .

JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY, 1982, 43 (05) :997-1013

[6] The Effect of Problem Construction Creativity on Solution Creativity Across Multiple Everyday Problems [J].

Arreola, Nicholas J. ;

Reiter-Palmon, Roni .

PSYCHOLOGY OF AESTHETICS CREATIVITY AND THE ARTS, 2016, 10 (03) :287-295

[7]

Atari M., 2023, PREPRINT, DOI DOI 10.31234/OSF.IO/5B26T

[8] Extension of the consensual assessment technique to nonparallel creative products [J].

Baer, J ;

Kaufman, JC ;

Gentile, CA .

CREATIVITY RESEARCH JOURNAL, 2004, 16 (01) :113-117

[9]

Baer J., 2009, HDB RES ASSESSMENT T, P65, DOI [10.4018/978-1-60566-667-9.ch004, DOI 10.4018/978-1-60566-667-9.CH004]

[10] Transforming Education: A Comprehensive Review of Generative Artificial Intelligence in Educational Settings through Bibliometric and Content Analysis [J].

Bahroun, Zied ;

Anane, Chiraz ;

Ahmed, Vian ;

Zacca, Andrew .

SUSTAINABILITY, 2023, 15 (17)

← 1 2 3 4 5 6 7 8 9 →