Embers of autoregression show how large language models are shaped by the problem they are trained to solve

被引:6
作者
McCoy, R. Thomas [1 ,3 ,4 ]
Yao, Shunyu [1 ,5 ]
Friedman, Dan [1 ]
Hardy, Mathew D. [2 ]
Griffiths, Thomas L. [1 ,2 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08542 USA
[2] Princeton Univ, Dept Psychol, Princeton, NJ 08542 USA
[3] Yale Univ, Dept Linguist, New Haven, CT 06520 USA
[4] Yale Univ, Wu Tsai Inst, New Haven, CT 06520 USA
[5] OpenAI, San Francisco, CA 94110 USA
关键词
cognitive science; artificial intelligence; large language models;
D O I
10.1073/pnas.2322420121
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach-which we call the teleological approach-we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system-one that has been shaped by its own particular set of pressures.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] TAR: A Think-Action-Reflection Framework for Complex Problem Solving with Large Language Models
    Wei, Zizhong
    Zhang, Qilai
    Duan, Qiang
    Wang, Guangxin
    Li, Rui
    Li, Xue
    Chen, Qibin
    Yang, Tong
    Zhang, Lei
    Jiang, Kai
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MODELING, NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING, CMNM 2024, 2024, : 282 - 286
  • [42] Future-proofing geotechnics workflows: accelerating problem-solving with large language models
    Wu, Stephen
    Otake, Yu
    Mizutani, Daijiro
    Liu, Chang
    Asano, Kotaro
    Sato, Nana
    Saito, Taiga
    Baba, Hidetoshi
    Fukunaga, Yusuke
    Higo, Yosuke
    Kamura, Akiyoshi
    Kodama, Shinnosuke
    Metoki, Masataka
    Nakamura, Tomoka
    Nakazato, Yuto
    Shioi, Akihiro
    Takenobu, Masahiro
    Tsukioka, Keigo
    Yoshikawa, Ryo
    GEORISK-ASSESSMENT AND MANAGEMENT OF RISK FOR ENGINEERED SYSTEMS AND GEOHAZARDS, 2024,
  • [43] Human-like problem-solving abilities in large language models using ChatGPT
    Orru, Graziella
    Piarulli, Andrea
    Conversano, Ciro
    Gemignani, Angelo
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
  • [44] Do Large Language Models Show Decision Heuristics Similar to Humans? A Case Study Using GPT-3.5
    Suri, Gaurav
    Slater, Lily R.
    Ziaee, Ali
    Nguyen, Morgan
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 2024, 153 (04) : 1066 - 1075
  • [45] Urgent, but How? Developing English Foreign Language Teachers' Digital Literacy in a Professional Learning Community Focusing on Large Language Models
    Xue, Lina
    EUROPEAN JOURNAL OF EDUCATION, 2025, 60 (01)
  • [46] How Artificial Intelligence Can Influence Elections: Analyzing the Large Language Models (LLMs) Political Bias
    Rotaru, George-Cristinel
    Anagnoste, Sorin
    Oancea, Vasile-Marian
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BUSINESS EXCELLENCE, 2024, 18 (01): : 1882 - 1891
  • [47] Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?
    Mathis, Walter S.
    Zhao, Sophia
    Pratt, Nicholas
    Weleff, Jeremy
    De Paoli, Stefano
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 255
  • [48] FedITD: A Federated Parameter-Efficient Tuning With Pre-Trained Large Language Models and Transfer Learning Framework for Insider Threat Detection
    Wang, Zhi Qiang
    Wang, Haopeng
    El Saddik, Abdulmotaleb
    IEEE ACCESS, 2024, 12 : 160396 - 160417
  • [49] How to Train Your Llama - Efficient Grammar-Based Application Fuzzing Using Large Language Models
    Mhiri, Ibrahim
    Boersig, Matthias
    Stark, Akim
    Baumgart, Ingmar
    SECURE IT SYSTEMS, NORDSEC 2024, 2025, 15396 : 239 - 257
  • [50] How Large Language Models Empower the Analysis of Online Public Engagement for Mega Infrastructure Projects: Cases in Hong Kong
    Wang, Ming
    Ma, Ruiyang
    Shen, Geoffrey Qiping
    Xue, Jin
    IEEE TRANSACTIONS ON ENGINEERING MANAGEMENT, 2025, 72 : 1262 - 1280