Autonomous medical evaluation for guideline adherence of large language models

被引:0
|
作者
Fast, Dennis [1 ]
Adams, Lisa C. [2 ]
Busch, Felix [2 ]
Fallon, Conor [1 ]
Huppertz, Marc [3 ]
Siepmann, Robert [3 ]
Prucker, Philipp [2 ]
Bayerl, Nadine [4 ]
Truhn, Daniel [3 ]
Makowski, Marcus [2 ]
Löser, Alexander [1 ]
Bressem, Keno K. [2 ,5 ]
机构
[1] DATEXIS, Berliner Hochschule für Technik (BHT), Berlin, Germany
[2] Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany
[3] Department of Radiology, University Hospital Aachen, Aachen, Germany
[4] Department of Radiology, Department of Radiology, University Hospital Erlangen, Friedrich- Alexander-University (FAU) Erlangen-Nuremberg, Erlangen, Germany
[5] Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany
关键词
39;
D O I
10.1038/s41746-024-01356-6
中图分类号
学科分类号
摘要
Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.
引用
收藏
相关论文
共 50 条
  • [21] Engineering Safety Requirements for Autonomous Driving with Large Language Models
    Nouri, Ali
    Cabrero-Daniel, Beatriz
    Torner, Fredrik
    Sivencrona, Hakan
    Berger, Christian
    32ND IEEE INTERNATIONAL REQUIREMENTS ENGINEERING CONFERENCE, RE 2024, 2024, : 218 - 228
  • [22] Advancing Autonomous Driving with Large Language Models: Integration and Impact
    Ananthajothi, K.
    Sudarshan, Satyaa G. S.
    Saran, J. U.
    2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
  • [23] Leveraging large language models for autonomous robotic mapping and navigation
    Espada, Jordan Pascual
    Qiu, Sofia Yiyu
    Crespo, Ruben Gonzalez
    Carus, Juan Luis
    INTERNATIONAL JOURNAL OF ADVANCED ROBOTIC SYSTEMS, 2025, 22 (02):
  • [24] Applying Large Language Models for intelligent industrial automation From theory to application: Towards autonomous systems with Large Language Models
    Xia, Yuchen
    Jazdi, Nasser
    Weyrich, Michael
    ATP MAGAZINE, 2024, (6-7):
  • [25] Adherence of Studies on Large Language Models for Medical Applications Published in Leading Medical Journals According to the MI-CLEAR-LLM Checklist
    Ko, Ji Su
    Heo, Hwon
    Suh, Chong Hyun
    Yi, Jeho
    Shim, Woo Hyun
    KOREAN JOURNAL OF RADIOLOGY, 2025, 26 (04) : 304 - 312
  • [26] A paradigm shift?-On the ethics of medical large language models
    Grote, Thomas
    Berens, Philipp
    BIOETHICS, 2024, 38 (05) : 383 - 390
  • [27] Conformal Prediction and Large Language Models for Medical Coding
    Snyder, Christopher
    Brodsky, Victor
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2024, 162
  • [28] Teaching the Limitations of Large Language Models in Medical School
    Gunawardene, Araliya N.
    Schmuter, Gabriella
    JOURNAL OF SURGICAL EDUCATION, 2024, 81 (05) : 625 - 625
  • [29] Reasoning with large language models for medical question answering
    Lucas, Mary M.
    Yang, Justin
    Pomeroy, Jon K.
    Yang, Christopher C.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09)
  • [30] Large language models in medical ethics: useful but not expert
    Ferrario, Andrea
    Biller-Andorno, Nikola
    JOURNAL OF MEDICAL ETHICS, 2024, 50 (09) : 653 - 654