Embers of autoregression show how large language models are shaped by the problem they are trained to solve

被引:6
|
作者
McCoy, R. Thomas [1 ,3 ,4 ]
Yao, Shunyu [1 ,5 ]
Friedman, Dan [1 ]
Hardy, Mathew D. [2 ]
Griffiths, Thomas L. [1 ,2 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08542 USA
[2] Princeton Univ, Dept Psychol, Princeton, NJ 08542 USA
[3] Yale Univ, Dept Linguist, New Haven, CT 06520 USA
[4] Yale Univ, Wu Tsai Inst, New Haven, CT 06520 USA
[5] OpenAI, San Francisco, CA 94110 USA
关键词
cognitive science; artificial intelligence; large language models;
D O I
10.1073/pnas.2322420121
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach-which we call the teleological approach-we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system-one that has been shaped by its own particular set of pressures.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models
    Ballout, Mohamad
    Krumnack, Ulf
    Heidemann, Gunther
    Kuehnberger, Kai-Uwe
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
  • [22] Large language models show human- like content biases in transmission chain experiments
    Acerbi, Alberto
    Stubbersfield, Joseph M.
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (44)
  • [23] Enhancing Domain Modeling with Pre-trained Large Language Models: An Automated Assistant for Domain Modelers
    Prokop, Dominik
    Stenchlak, Stepan
    Skoda, Petr
    Klimek, Jakub
    Necasky, Martin
    CONCEPTUAL MODELING, ER 2024, 2025, 15238 : 235 - 253
  • [24] SEMbeddings: how to evaluate model misfit before data collection using large-language models
    Feraco, Tommaso
    Toffalini, Enrico
    FRONTIERS IN PSYCHOLOGY, 2025, 15
  • [25] How Can Recommender Systems Benefit from Large Language Models: A Survey
    Lin, Jianghao
    Dai, Xinyi
    Xi, Yunjia
    Liu, Wei wen
    Chen, Bo
    Zhang, Hao
    Liu, Yong
    Wu, Chuhan
    Li, Xiangyang
    Zhu, Chenxu
    Guo, Huifeng
    Yu, Yong
    Tang, Ruiming
    Zhang, Weinan
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2025, 43 (02)
  • [26] Large language models based vulnerability detection: How does it enhance performance?
    Xuan, Cho Do
    Quang, Dat Bui
    Quang, Vinh Dang
    INTERNATIONAL JOURNAL OF INFORMATION SECURITY, 2025, 24 (01)
  • [27] How to train your stochastic parrot: large language models for political texts
    Ornstein, Joseph T.
    Blasingame, Elise N.
    Truscott, Jake S.
    POLITICAL SCIENCE RESEARCH AND METHODS, 2025,
  • [28] How Well Do Large Language Models Understand Tables in Materials Science?
    Circi, Defne
    Khalighinejad, Ghazal
    Chen, Anlan
    Dhingra, Bhuwan
    Brinson, L. Catherine
    INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2024, 13 (03) : 669 - 687
  • [29] Using Large Language Models To Diagnose Math Problem-solving Skills At Scale
    Jin, Hyoungwook
    Kim, Yoonsu
    Park, Yeon Su
    Tilekbay, Bekzat
    Son, Jinho
    Kim, Juho
    PROCEEDINGS OF THE ELEVENTH ACM CONFERENCE ON LEARNING@SCALE, L@S 2024, 2024, : 471 - 475
  • [30] LARGE LANGUAGE MODELS "AD REFERENDUM": HOW GOOD ARE THEY AT MACHINE TRANSLATION IN THE LEGAL DOMAIN?
    Briva-Iglesias, Vicent
    Camargo, Joao Lucas Cavalheiro
    Dogru, Gokhan
    MONTI, 2024, 16 : 75 - 107