Embers of autoregression show how large language models are shaped by the problem they are trained to solve

被引:6
|
作者
McCoy, R. Thomas [1 ,3 ,4 ]
Yao, Shunyu [1 ,5 ]
Friedman, Dan [1 ]
Hardy, Mathew D. [2 ]
Griffiths, Thomas L. [1 ,2 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08542 USA
[2] Princeton Univ, Dept Psychol, Princeton, NJ 08542 USA
[3] Yale Univ, Dept Linguist, New Haven, CT 06520 USA
[4] Yale Univ, Wu Tsai Inst, New Haven, CT 06520 USA
[5] OpenAI, San Francisco, CA 94110 USA
关键词
cognitive science; artificial intelligence; large language models;
D O I
10.1073/pnas.2322420121
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach-which we call the teleological approach-we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system-one that has been shaped by its own particular set of pressures.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] How to Use Large Language Models for Empirical Legal Research
    Choi, Jonathan H.
    JOURNAL OF INSTITUTIONAL AND THEORETICAL ECONOMICS-ZEITSCHRIFT FUR DIE GESAMTE STAATSWISSENSCHAFT, 2024, 180 (02): : 214 - 233
  • [2] Assessing Phrase Break of ESL Speech with Pre-trained Language Models and Large Language Models
    Wang, Zhiyi
    Mao, Shaoguang
    Wu, Wenshan
    Xia, Yan
    Deng, Yan
    Tien, Jonathan
    INTERSPEECH 2023, 2023, : 4194 - 4198
  • [3] Adopting Pre-trained Large Language Models for Regional Language Tasks: A Case Study
    Gaikwad, Harsha
    Kiwelekar, Arvind
    Laddha, Manjushree
    Shahare, Shashank
    INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2023, PT I, 2024, 14531 : 15 - 25
  • [4] A Study on the Representativeness Heuristics Problem in Large Language Models
    Ryu, Jongwon
    Kim, Jungeun
    Kim, Junyeong
    IEEE ACCESS, 2024, 12 : 147958 - 147966
  • [5] Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey
    Min, Bonan
    Ross, Hayley
    Sulem, Elior
    Ben Veyseh, Amir Pouran
    Nguyen, Thien Huu
    Sainz, Oscar
    Agirre, Eneko
    Heintz, Ilana
    Roth, Dan
    ACM COMPUTING SURVEYS, 2024, 56 (02)
  • [6] How secure is AI-generated code: a large-scale comparison of large language models
    Tihanyi, Norbert
    Bisztray, Tamas
    Ferrag, Mohamed Amine
    Jain, Ridhi
    Cordeiro, Lucas C.
    EMPIRICAL SOFTWARE ENGINEERING, 2025, 30 (02)
  • [7] Looking Through the Deep Glasses: How Large Language Models Enhance Explainability of Deep Learning Models
    Spitzer, Philipp
    Celis, Sebastian
    Martin, Dominik
    Kuehl, Niklas
    Satzger, Gerhard
    PROCEEDINGS OF THE 2024 CONFERENCE ON MENSCH UND COMPUTER, MUC 2024, 2024, : 566 - 570
  • [8] On the potential of large language models to solve semantics-aware process mining tasks
    Adrian Rebmann
    Fabian David Schmidt
    Goran Glavaš
    Han van der Aa
    Process Science, 2 (1):
  • [9] Empowering patients: how accurate and readable are large language models in renal cancer education
    Halawani, Abdulghafour
    Almehmadi, Sultan G.
    Alhubaishy, Bandar A.
    Alnefaie, Ziyad A.
    Hasan, Mudhar N.
    FRONTIERS IN ONCOLOGY, 2024, 14
  • [10] How to Optimize Prompting for Large Language Models in Clinical Research
    Lee, Jeong Hyun
    Shin, Jaeseung
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (10) : 869 - 873