Prompt Wrangling: On Replication and Generalization in Large Language Models for PCG Levels

被引：0

作者：

Karkaj, Arash Moradi ^{[1
]}

Nelson, Mark J. ^{[2
]}

Koutis, Ioannis ^{[1
]}

Hoover, Amy K. ^{[1
]}

机构：

[1] New Jersey Inst Technol, Newark, NJ 07102 USA

[2] Amer Univ, Washington, DC 20016 USA

来源：

PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF DIGITAL GAMES, FDG 2024 | 2024年

关键词：

Procedural content generation (PCG); Large Language Models (LLMs); Generalizability; Evaluating Generalization; Science Birds;

D O I：

10.1145/3649921.3659853

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The ChatGPT4PCG competition calls for participants to submit inputs to ChatGPT or prompts that guide its output toward instructions to generate levels as sequences of Tetris-like block drops. Prompts submitted to the competition are queried by ChatGPT to generate levels that resemble letters of the English alphabet. Levels are evaluated based on their similarity to the target letter and physical stability in the game engine. This provides a quantitative evaluation setting for prompt-based procedural content generation (PCG), an approach that has been gaining popularity in PCG, as in other areas of generative AI. This paper focuses on replicating and generalizing the competition results. The replication experiments in the paper first aim to test whether the number of responses gathered from ChatGPT is sufficient to account for the stochasticity requery the original prompt submissions to rerun the original scripts from the competition on different machines about six months after the competition organizers. We re-run the competition, using the original scripts, but on our own machines, several months later, and with varying sample sizes. We find that results largely replicate, except that two of the 15 submissions do much better in our replication, for reasons we can only partly determine. When it comes to generalization, we notice that the top-performing prompt has instructions for all 26 target levels hardcoded, which is at odds with the PCGML goal of generating new, previously unseen content from examples. We perform experiments in a more restricted few-shot prompting scenario, and find that generalization remains a challenge for current approaches.

引用

页数：8

共 21 条

[1]

ChatGPT4PCG, 2023, ChatGPT4PCG Resources

[2]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[3]

Ferreira L, 2014, IEEE CONF COMPU INTE

[4] Using estimation of distribution algorithm for procedural content generation in video games [J].

Karkaj, Arash Moradi ;

Lotfi, Shahriar .

GENETIC PROGRAMMING AND EVOLVABLE MACHINES, 2022, 23 (04) :495-533

[5]

Kreuk Felix, 2023, 11 INT C LEARN REPR

[6]

Lange RT, 2024, Arxiv, DOI arXiv:2402.18381

[7]

Meyerson E, 2024, Arxiv, DOI [arXiv:2302.12170, 10.48550/arXiv.2302.12170, DOI 10.48550/ARXIV.2302.12170]

[8]

National Academies of Sciences Engineering and Medicine, 2019, Reproducibility and Replicability in Science.

[9]

2023, Arxiv, DOI [arXiv:2303.08774, 10.48550/ARXIV.2303.08774, DOI 10.48550/ARXIV.2303.08774]

[10]

Ramesh A., 2022, arXiv, DOI [DOI 10.48550/ARXIV.2204.06125, 10.48550/arXiv.2204.06125]

← 1 2 3 →