Autonomous medical evaluation for guideline adherence of large language models

被引:0
|
作者
Fast, Dennis [1 ]
Adams, Lisa C. [2 ]
Busch, Felix [2 ]
Fallon, Conor [1 ]
Huppertz, Marc [3 ]
Siepmann, Robert [3 ]
Prucker, Philipp [2 ]
Bayerl, Nadine [4 ]
Truhn, Daniel [3 ]
Makowski, Marcus [2 ]
Löser, Alexander [1 ]
Bressem, Keno K. [2 ,5 ]
机构
[1] DATEXIS, Berliner Hochschule für Technik (BHT), Berlin, Germany
[2] Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany
[3] Department of Radiology, University Hospital Aachen, Aachen, Germany
[4] Department of Radiology, Department of Radiology, University Hospital Erlangen, Friedrich- Alexander-University (FAU) Erlangen-Nuremberg, Erlangen, Germany
[5] Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany
关键词
39;
D O I
10.1038/s41746-024-01356-6
中图分类号
学科分类号
摘要
Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.
引用
收藏
相关论文
共 50 条
  • [1] Evaluation of institutional guideline adherence for carbapenem use at a large academic medical center
    Michalski, Derek
    Ghamrawi, Riane J.
    Tsigrelis, Constantine
    INFECTIOUS DISEASES, 2018, 50 (03) : 226 - 228
  • [2] Evaluation of large language models for the classification of medical device software
    Yu Han
    Aaron Ceross
    Florence Bourgeois
    Paulo Savaget
    Jeroen HMBergmann
    Bio-Design and Manufacturing, 2024, 7 (05) : 819 - 822
  • [3] Evaluation of large language models for the classification of medical device software
    Han, Yu
    Ceross, Aaron
    Bourgeois, Florence
    Savaget, Paulo
    Bergmann, Jeroen H. M.
    BIO-DESIGN AND MANUFACTURING, 2024, 7 (05) : 819 - 822
  • [4] Autonomous chemical research with large language models
    Boiko, Daniil A.
    Macknight, Robert
    Kline, Ben
    Gomes, Gabe
    NATURE, 2023, 624 (7992) : 570 - +
  • [5] Fully Autonomous Programming with Large Language Models
    Liventsev, Vadim
    Grishina, Anastasiia
    Harma, Aki
    Moonen, Leon
    PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, GECCO 2023, 2023, : 1146 - 1155
  • [6] Autonomous chemical research with large language models
    Daniil A. Boiko
    Robert MacKnight
    Ben Kline
    Gabe Gomes
    Nature, 2023, 624 : 570 - 578
  • [7] Evaluation of large language models as a diagnostic aid for complex medical cases
    Rios-Hoyo, Alejandro
    Shan, Naing Lin
    Li, Anran
    Pearson, Alexander T.
    Pusztai, Lajos
    Howard, Frederick M.
    FRONTIERS IN MEDICINE, 2024, 11
  • [8] Benchmarking medical large language models
    Bakhshandeh, Sadra
    NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
  • [9] A Survey on Multimodal Large Language Models for Autonomous Driving
    Cui, Can
    Ma, Yunsheng
    Cao, Xu
    Ye, Wenqian
    Zhou, Yang
    Liang, Kaizhao
    Chen, Jintai
    Lu, Juanwu
    Yang, Zichong
    Liao, Kuei-Da
    Gao, Tianren
    Li, Erlong
    Tang, Kun
    Cao, Zhipeng
    Zhou, Tong
    Liu, Ao
    Yan, Xinrui
    Mei, Shuqi
    Cao, Jianguo
    Wang, Ziran
    Zheng, Chao
    2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 958 - 979
  • [10] A review of large language models and autonomous agents in chemistry
    Ramos, Mayk Caldas
    Collison, Christopher J.
    White, Andrew D.
    CHEMICAL SCIENCE, 2025, 16 (06) : 2514 - 2572