Theory of mind performance of large language models: A comparative analysis of Turkish and English

被引：3

作者：

Unlutabak, Burcu ^{[1
]}

Bal, Onur ^{[1
]}

机构：

[1] Yeditepe Univ, Dept Psychol, 26 Agustos Yerleskesi,Atasehir Inonu Mah Kayisdagi, Istanbul, Turkiye

来源：

COMPUTER SPEECH AND LANGUAGE | 2025年 / 89卷

关键词：

Theory of mind; Large language models; First-order false belief; Second-order false belief; FALSE; BELIEFS; JAPANESE; INFANT;

D O I：

10.1016/j.csl.2024.101698

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Theory of mind (ToM), understanding others' mental states, is a defining skill belonging to humans. Research assessing LLMs' ToM performance yields conflicting findings and leads to discussions about whether and how they could show ToM understanding. Psychological research indicates that the characteristics of a specific language can influence how mental states are represented and communicated. Thus, it is reasonable to expect language characteristics to influence how LLMs communicate with humans, especially when the conversation involves references to mental states. This study examines how these characteristics affect LLMs' ToM performance by evaluating GPT 3.5 and 4 performances in English and Turkish. Turkish provides an excellent contrast to English since Turkish has a different syntactic structure and special verbs, san- and zannet-, meaning "falsely believe." Using Open AI's Chat Completion API, we collected responses from GPT models for first- and second-order ToM scenarios in English and Turkish. Our innovative approach combined completion prompts and open-ended questions within the same chat session, offering deep insights into models' reasoning processes. Our data showed that while GPT models can respond accurately to standard ToM tasks (100% accuracy), their performance deteriorates (below chance level) with slight modifications. This high sensitivity suggests a lack of robustness in ToM performance. GPT 4 outperformed its predecessor, GPT 3.5, showing improvement in ToM performance to some extent. The models generally performed better when tasks were presented in English than in Turkish. These findings indicate that GPT models cannot reliably pass first-order and second-order ToM tasks in either of the languages yet. The findings have significant implications for Explainability of LLMs by highlighting challenges and biases that they face when simulating human-like ToM understanding in different languages.

引用

页数：22

共 56 条

[1]

Aher G, 2023, PR MACH LEARN RES, V202, P337

[2] On Privileging the Role of Gaze in Infant Social Cognition [J].

Akhtar, Nameera ;

Gernsbacher, Morton Ann .

CHILD DEVELOPMENT PERSPECTIVES, 2008, 2 (02) :59-65

[3] Five-Year-Olds' Systematic Errors in Second-Order False Belief Tasks Are Due to First-Order Theory of Mind Strategy Selection: A Computational Modeling Study [J].

Arslan, Burcu ;

Taatgen, Niels A. ;

Verbrugge, Rineke .

FRONTIERS IN PSYCHOLOGY, 2017, 8

[4]

Bartsch K., 1995, Children talk about the mind

[5] Fitting Linear Mixed-Effects Models Using lme4 [J].

Bates, Douglas ;

Maechler, Martin ;

Bolker, Benjamin M. ;

Walker, Steven C. .

JOURNAL OF STATISTICAL SOFTWARE, 2015, 67 (01) :1-48

[6] Using cognitive psychology to understand GPT-3 [J].

Binz, Marcel ;

Schulz, Eric .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (06)

[7]

Bostrom N., 2014, Superintelligence: Paths, Dangers, Strategies

[8]

Bruner J, 1986, Actual Minds, Possible Worlds

[9]

Bubeck S, 2023, Arxiv, DOI [arXiv:2303.12712, 10.48550/ARXIV.2303.12712]

[10] Does the chimpanzee have a theory of mind? 30 years later [J].

Call, Josep ;

Tomasello, Michael .

TRENDS IN COGNITIVE SCIENCES, 2008, 12 (05) :187-192

← 1 2 3 4 5 6 →