Evaluating alignment in large language models: a review of methodologies

被引:0
|
作者
Uma E. Sarkar [1 ]
机构
[1] Texas A&M University,
来源
AI and Ethics | 2025年 / 5卷 / 3期
关键词
AI alignment; Large language models; Adversarial testing; Constitutional AI; AI safety evaluation;
D O I
10.1007/s43681-024-00637-w
中图分类号
学科分类号
摘要
As artificial intelligence systems become more complex and widely adopted, ensuring their alignment with human values and goals is essential to prevent unintended harm. This paper reviews four primary methodologies for evaluating alignment in Large Language Models (LLMs): human feedback, adversarial testing by domain experts, AI red teaming, and the constitutional approach to AI safety. I examine the strengths, limitations, and practical applications of each approach, highlighting critical challenges such as detecting deceptive behavior (e.g., “AI sleeper agents”) and the ethical risks of adversarial training. Additionally, this paper explores the relationship between alignment and accountability, addressing the legal and ethical questions that arise as AI systems are deployed in real-world contexts. I outline future research directions in this rapidly evolving field to support the safe and ethical development of AI systems. This study aims to provide researchers and practitioners with a structured overview of current LLM testing methodologies and insights into areas needing further exploration.
引用
收藏
页码:3233 / 3240
页数:7
相关论文
共 50 条
  • [1] Social Value Alignment in Large Language Models
    Abbol, Giulio Antonio
    Marchesi, Serena
    Wykowska, Agnieszka
    Belpaeme, Tony
    VALUE ENGINEERING IN ARTIFICIAL INTELLIGENCE, VALE 2023, 2024, 14520 : 83 - 97
  • [2] Evaluating Large Language Models for Material Selection
    Grandi, Daniele
    Jain, Yash Patawari
    Groom, Allin
    Cramer, Brandon
    Mccomb, Christopher
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [3] Evaluating large language models for annotating proteins
    Vitale, Rosario
    Bugnon, Leandro A.
    Fenoy, Emilio Luis
    Milone, Diego H.
    Stegmayer, Georgina
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
  • [4] Cultural bias and cultural alignment of large language models
    Tao, Yan
    Viberg, Olga
    Baker, Ryan S.
    Kizilcec, Rene F.
    PNAS NEXUS, 2024, 3 (09):
  • [5] EVALUATING LARGE LANGUAGE MODELS ON THEIR ACCURACY AND COMPLETENESS
    Edalat, Camellia
    Kirupaharan, Nila
    Dalvin, Lauren A.
    Mishra, Kapil
    Marshall, Rayna
    Xu, Hannah
    Francis, Jasmine H.
    Berkenstock, Meghan
    RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2025, 45 (01): : 128 - 132
  • [6] Evaluating large language models for software testing
    Li, Yihao
    Liu, Pan
    Wang, Haiyang
    Chu, Jie
    Wong, W. Eric
    COMPUTER STANDARDS & INTERFACES, 2025, 93
  • [7] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
  • [8] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [9] Evaluating Human-Large Language Model Alignment in Group Process
    He, Yidong
    Liu, Yongbin
    Ouyang, Chunping
    Liu, Huan
    Han, Wenyong
    Gao, Yu
    Zhu, Chi
    Tang, Yi
    Zhong, Jin
    Zhou, Shuda
    Huang, Le
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 412 - 423
  • [10] Evaluating large language models in theory of mind tasks
    Kosinski, Michal
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (45)