Do Multimodal Large Language Models and Humans Ground Language Similarly?

被引：0

作者：

Jones, Cameron R. ^{[1
]}

Bergen, Benjamin ^{[1
]}

Trott, Sean ^{[1
]}

机构：

[1] Univ Calif San Diego, Dept Cognit Sci, San Diego, CA 92093 USA

来源：

COMPUTATIONAL LINGUISTICS | 2024年 / 50卷 / 04期

关键词：

REPRESENTATION; ORIENTATION; EMBODIMENT; MOTOR;

D O I：

10.1162/coli_a_00531

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world-for failing to solve the "symbol grounding problem." Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities-and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through "embodied simulation," the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM's lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture-despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.

引用

页码：1415 / 1440

页数：26

共 62 条

[1] [Anonymous], 2012, Louder than words: The new science of how the mind makes meaning, DOI DOI 10.1353/LAN.2014.0025
[2] Barsalou LW, 1999, BEHAV BRAIN SCI, V22, P577, DOI 10.1017/S0140525X99532147
[3] Bender E. M., 2020, P 58 ANN M ASS COMP, P5185, DOI DOI 10.18653/V1/2020.ACL-MAIN.463
[4] Bergen B, 2016, ROUT HANDB LINGUIST, P142
[5] The neurobiology of semantic memory
Binder, Jeffrey R.
Desai, Rutvik H.
[J]. TRENDS IN COGNITIVE SCIENCES, 2011, 15 (11) : 527 - 536
[6] Bisk Y, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P8718
[7] Multimodal Distributional Semantics
Bruni, Elia
Nam Khanh Tran
Baroni, Marco
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2014, 49 : 1 - 47
[8] Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Bugliarello, Emanuele
Cotterell, Ryan
Okazaki, Naoaki
Elliott, Desmond
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 978 - 994
[9] Multimodel inference - understanding AIC and BIC in model selection
Burnham, KP
Anderson, DR
[J]. SOCIOLOGICAL METHODS & RESEARCH, 2004, 33 (02) : 261 - 304
[10] Language Model Behavior: A Comprehensive Survey
Chang, Tyler A.
Bergen, Benjamin K.
[J]. COMPUTATIONAL LINGUISTICS, 2023, 50 (01) : 293 - 350

← 1 2 3 4 5 6 7 →