Autonomous medical evaluation for guideline adherence of large language models

被引:0
|
作者
Fast, Dennis [1 ]
Adams, Lisa C. [2 ]
Busch, Felix [2 ]
Fallon, Conor [1 ]
Huppertz, Marc [3 ]
Siepmann, Robert [3 ]
Prucker, Philipp [2 ]
Bayerl, Nadine [4 ]
Truhn, Daniel [3 ]
Makowski, Marcus [2 ]
Löser, Alexander [1 ]
Bressem, Keno K. [2 ,5 ]
机构
[1] DATEXIS, Berliner Hochschule für Technik (BHT), Berlin, Germany
[2] Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany
[3] Department of Radiology, University Hospital Aachen, Aachen, Germany
[4] Department of Radiology, Department of Radiology, University Hospital Erlangen, Friedrich- Alexander-University (FAU) Erlangen-Nuremberg, Erlangen, Germany
[5] Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany
关键词
39;
D O I
10.1038/s41746-024-01356-6
中图分类号
学科分类号
摘要
Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.
引用
收藏
相关论文
共 50 条
  • [31] Large Language Models in Healthcare and Medical Domain: A Review
    Nazi, Zabir Al
    Peng, Wei
    INFORMATICS-BASEL, 2024, 11 (03):
  • [32] Evaluating large language models on medical evidence summarization
    Tang, Liyan
    Sun, Zhaoyi
    Idnay, Betina
    Nestor, Jordan G.
    Soroush, Ali
    Elias, Pierre A.
    Xu, Ziyang
    Ding, Ying
    Durrett, Greg
    Rousseau, Justin F.
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [33] Evaluating large language models on medical evidence summarization
    Liyan Tang
    Zhaoyi Sun
    Betina Idnay
    Jordan G. Nestor
    Ali Soroush
    Pierre A. Elias
    Ziyang Xu
    Ying Ding
    Greg Durrett
    Justin F. Rousseau
    Chunhua Weng
    Yifan Peng
    npj Digital Medicine, 6
  • [34] Ethics of large language models in medicine and medical research
    Li, Hanzhou
    Moon, John T.
    Purkayastha, Saptarshi
    Celi, Leo Anthony
    Trivedi, Hari
    Gichoya, Judy W.
    LANCET DIGITAL HEALTH, 2023, 5 (06): : E333 - E335
  • [35] Poisoning medical knowledge using large language models
    Yang, Junwei
    Xu, Hanwen
    Mirzoyan, Srbuhi
    Chen, Tong
    Liu, Zixuan
    Liu, Zequn
    Ju, Wei
    Liu, Luchen
    Xiao, Zhiping
    Zhang, Ming
    Wang, Sheng
    NATURE MACHINE INTELLIGENCE, 2024, 6 (10) : 1156 - 1168
  • [36] Large language models and rheumatology: a comparative evaluation
    Venerito, Vincenzo
    Puttaswamy, Darshan
    Iannone, Florenzo
    Gupta, Latika
    LANCET RHEUMATOLOGY, 2023, 5 (10): : E574 - E578
  • [37] Automatic Evaluation of Attribution by Large Language Models
    Yue, Xiang
    Wang, Boshi
    Chen, Ziru
    Zhang, Kai
    Su, Yu
    Sun, Huan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4615 - 4635
  • [38] Towards Autonomous Testing Agents via Conversational Large Language Models
    Feldt, Robert
    Kang, Sungmin
    Yoon, Juyeon
    Yoo, Shin
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1688 - 1693
  • [39] Drive Like a Human: Rethinking Autonomous Driving with Large Language Models
    Fu, Daocheng
    Li, Xin
    Wen, Licheng
    Dou, Min
    Cai, Pinlong
    Shi, Botian
    Qiao, Yu
    2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 910 - 919
  • [40] Large Language Models Empowered Autonomous Edge AI for Connected Intelligence
    Shen, Yifei
    Shao, Jiawei
    Zhang, Xinjie
    Lin, Zehong
    Pan, Hao
    Li, Dongsheng
    Zhang, Jun
    Letaief, Khaled B.
    IEEE COMMUNICATIONS MAGAZINE, 2024, 62 (10) : 140 - 146