Autonomous medical evaluation for guideline adherence of large language models

被引:0
|
作者
Fast, Dennis [1 ]
Adams, Lisa C. [2 ]
Busch, Felix [2 ]
Fallon, Conor [1 ]
Huppertz, Marc [3 ]
Siepmann, Robert [3 ]
Prucker, Philipp [2 ]
Bayerl, Nadine [4 ]
Truhn, Daniel [3 ]
Makowski, Marcus [2 ]
Löser, Alexander [1 ]
Bressem, Keno K. [2 ,5 ]
机构
[1] DATEXIS, Berliner Hochschule für Technik (BHT), Berlin, Germany
[2] Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany
[3] Department of Radiology, University Hospital Aachen, Aachen, Germany
[4] Department of Radiology, Department of Radiology, University Hospital Erlangen, Friedrich- Alexander-University (FAU) Erlangen-Nuremberg, Erlangen, Germany
[5] Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany
关键词
39;
D O I
10.1038/s41746-024-01356-6
中图分类号
学科分类号
摘要
Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.
引用
收藏
相关论文
共 50 条
  • [41] The TRIPOD-LLM reporting guideline for studies using large language models
    Gallifant, Jack
    Afshar, Majid
    Ameen, Saleem
    Aphinyanaphongs, Yindalon
    Chen, Shan
    Cacciamani, Giovanni
    Demner-Fushman, Dina
    Dligach, Dmitriy
    Daneshjou, Roxana
    Fernandes, Chrystinne
    Hansen, Lasse Hyldig
    Landman, Adam
    Lehmann, Lisa
    Mccoy, Liam G.
    Miller, Timothy
    Moreno, Amy
    Munch, Nikolaj
    Restrepo, David
    Savova, Guergana
    Umeton, Renato
    Gichoya, Judy Wawira
    Collins, Gary S.
    Moons, Karel G. M.
    Celi, Leo A.
    Bitterman, Danielle S.
    NATURE MEDICINE, 2025, 31 (01) : 60 - 69
  • [42] Implementation And Evaluation Of An Alcohol Withdrawal Guideline In A Large Academic Medical Center
    Richman, L. S.
    Dzierba, A.
    Muskin, P.
    Bouchard, N.
    Schek, V.
    AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE, 2014, 189
  • [43] Evaluation and mitigation of cognitive biases in medical language models
    Schmidgall, Samuel
    Harris, Carl
    Essien, Ime
    Olshvang, Daniel
    Rahman, Tawsifur
    Kim, Ji Woong
    Ziaei, Rojin
    Eshraghian, Jason
    Abadir, Peter
    Chellappa, Rama
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [44] Embracing Large Language Models for Medical Applications: Opportunities and Challenges
    Karabacak, Mert
    Margetis, Konstantinos
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (05)
  • [45] Large language models (ChatGPT) in medical education: Embrace or abjure?
    Luke, Nathasha
    Taneja, Reshma
    Ban, Kenneth
    Samarasekera, Dujeepa
    Yap, Celestial T.
    ASIA PACIFIC SCHOLAR, 2023, 8 (04): : 50 - 52
  • [46] Leveraging foundation and large language models in medical artificial intelligence
    Wong, Io Nam
    Monteiro, Olivia
    Baptista-Hon, Daniel T.
    Wang, Kai
    Lu, Wenyang
    Sun, Zhuo
    Nie, Sheng
    Yin, Yun
    CHINESE MEDICAL JOURNAL, 2024, 137 (21) : 2529 - 2539
  • [47] Large language models for generating medical examinations: systematic review
    Artsi, Yaara
    Sorin, Vera
    Konen, Eli
    Glicksberg, Benjamin S.
    Nadkarni, Girish
    Klang, Eyal
    BMC MEDICAL EDUCATION, 2024, 24 (01)
  • [48] A systematic review of large language models and their implications in medical education
    Lucas, Harrison C.
    Upperman, Jeffrey S.
    Robinson, Jamie R.
    MEDICAL EDUCATION, 2024, 58 (11) : 1276 - 1285
  • [49] Leveraging foundation and large language models in medical artificial intelligence
    Wong Io Nam
    Monteiro Olivia
    BaptistaHon Daniel T
    Wang Kai
    Lu Wenyang
    Sun Zhuo
    Nie Sheng
    Yin Yun
    中华医学杂志英文版, 2024, 137 (21)
  • [50] Enhancing the assessment of large language models in medical information generation
    Leiwa, Aher K.
    Lhusseiny, Bdelrahman M.
    OPHTHALMOLOGY RETINA, 2024, 8 (05): : e15 - e15