Embers of autoregression show how large language models are shaped by the problem they are trained to solve

被引:6
|
作者
McCoy, R. Thomas [1 ,3 ,4 ]
Yao, Shunyu [1 ,5 ]
Friedman, Dan [1 ]
Hardy, Mathew D. [2 ]
Griffiths, Thomas L. [1 ,2 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08542 USA
[2] Princeton Univ, Dept Psychol, Princeton, NJ 08542 USA
[3] Yale Univ, Dept Linguist, New Haven, CT 06520 USA
[4] Yale Univ, Wu Tsai Inst, New Haven, CT 06520 USA
[5] OpenAI, San Francisco, CA 94110 USA
关键词
cognitive science; artificial intelligence; large language models;
D O I
10.1073/pnas.2322420121
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach-which we call the teleological approach-we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system-one that has been shaped by its own particular set of pressures.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] LARGE LANGUAGE MODELS "AD REFERENDUM": HOW GOOD ARE THEY AT MACHINE TRANSLATION IN THE LEGAL DOMAIN?
    Briva-Iglesias, Vicent
    Camargo, Joao Lucas Cavalheiro
    Dogru, Gokhan
    MONTI, 2024, 16 : 75 - 107
  • [32] CloChat: Understanding How People Customize, Interact, and Experience Personas in Large Language Models
    Ha, Juhye
    Jeon, Hyeon
    Han, Daeun
    Seo, Jinwook
    Oh, Changhoon
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
  • [33] Variability in Large Language Models' Responses to Medical Licensing and Certification Examinations. Comment on "How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment"
    Epstein, Richard H.
    Dexter, Franklin
    JMIR MEDICAL EDUCATION, 2023, 9
  • [34] Parameter-Efficient Fine-Tuning of Pre-trained Large Language Models for Financial Text Analysis
    Langa, Kelly
    Wang, Hairong
    Okuboyejo, Olaperi
    ARTIFICIAL INTELLIGENCE RESEARCH, SACAIR 2024, 2025, 2326 : 3 - 20
  • [35] How understanding large language models can inform the use of ChatGPT in physics education
    Polverini, Giulia
    Gregorcic, Bor
    EUROPEAN JOURNAL OF PHYSICS, 2024, 45 (02)
  • [36] On Using Large Language Models Pre-trained on Digital Twins as Oracles to Foster the Use of Formal Methods in Practice
    Autexier, Serge
    LEVERAGING APPLICATIONS OF FORMAL METHODS, VERIFICATION AND VALIDATION: SOFTWARE ENGINEERING METHODOLOGIES, PT IV, ISOLA 2024, 2025, 15222 : 30 - 43
  • [37] Unlocking language barriers: Assessing pre-trained large language models across multilingual tasks and unveiling the black box with Explainable Artificial Intelligence
    Kastrati, Muhamet
    Imran, Ali Shariq
    Hashmi, Ehtesham
    Kastrati, Zenun
    Daudpota, Sher Muhammad
    Biba, Marenglen
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 149
  • [38] Large language models based vulnerability detection: How does it enhance performance?Large language models based vulnerability detection: How does it enhance performance?C.D. Xuan et al.
    Cho Do Xuan
    Dat Bui Quang
    Vinh Dang Quang
    International Journal of Information Security, 2025, 24 (1)
  • [39] Automated Scoring of Creative Problem Solving With Large Language Models: A Comparison of Originality and Quality Ratings
    Luchini, Simone A.
    Maliakkal, Nadine T.
    Distefano, Paul V.
    Laverghetta Jr, Antonio
    Patterson, John D.
    Beaty, Roger E.
    Reiter-Palmon, Roni
    PSYCHOLOGY OF AESTHETICS CREATIVITY AND THE ARTS, 2025,
  • [40] Using Large Language Models to Support Teaching and Learning of Word Problem Solving in Tutoring Systems
    Arnau-Blasco, Jaime
    Arevalillo-Herraez, Miguel
    Solera-Monforte, Sergi
    Wu, Yuyan
    GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 3 - 13