Large language models encode clinical knowledge

被引：779

作者：

Singhal, Karan ^{[1
]}

Azizi, Shekoofeh ^{[1
]}

Tu, Tao ^{[1
]}

Mahdavi, S. Sara ^{[1
]}

Wei, Jason ^{[1
]}

Chung, Hyung Won ^{[1
]}

Scales, Nathan ^{[1
]}

Tanwani, Ajay ^{[1
]}

Cole-Lewis, Heather ^{[1
]}

Pfohl, Stephen ^{[1
]}

Payne, Perry ^{[1
]}

Seneviratne, Martin ^{[1
]}

Gamble, Paul ^{[1
]}

Kelly, Chris ^{[1
]}

Babiker, Abubakr ^{[1
]}

Schaerli, Nathanael ^{[1
]}

Chowdhery, Aakanksha ^{[1
]}

Mansfield, Philip ^{[1
]}

Demner-Fushman, Dina ^{[2
]}

Arcas, Blaise Aguera y ^{[1
]}

Webster, Dale ^{[1
]}

Corrado, Greg S. ^{[1
]}

Matias, Yossi ^{[1
]}

Chou, Katherine ^{[1
]}

Gottweis, Juraj ^{[1
]}

Tomasev, Nenad ^{[3
]}

Liu, Yun ^{[1
]}

Rajkomar, Alvin ^{[1
]}

Barral, Joelle ^{[1
]}

Semturs, Christopher ^{[1
]}

Karthikesalingam, Alan ^{[1
]}

Natarajan, Vivek ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] Natl Lib Med, Bethesda, MD USA

[3] DeepMind, London, England

来源：

NATURE | 2023年 / 620卷 / 7972期

关键词：

HARM;

D O I：

10.1038/s41586-023-06291-2

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

引用

页码：172 / +

页数：28

共 3 条

[1] Moderating Illicit Online Image Promotion for Unsafe User Generated Content Games Using Large Vision-Language Models
Guo, Keyan
Utkarsh, Ayush
Ding, Wenbo
Ondracek, Isabelle
Zhao, Ziming
Freeman, Guo
Vishwamitra, Nishant
Hu, Hongxin
PROCEEDINGS OF THE 33RD USENIX SECURITY SYMPOSIUM, SECURITY 2024, 2024, : 5787 - 5804
[2] Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Dermatology Continuing Medical Education Questions
Cai, Zhuo Ran
Chen, Michael L.
Kim, Jiyeong
Novoa, Roberto A.
Barnes, Leandra A.
Beam, Andrew
Linos, Eleni
JOURNAL OF INVESTIGATIVE DERMATOLOGY, 2024, 144 (08) : 1877 - 1879
[3] Characterizing gender differences in nonsuicidal self-injury: Evidence from a large clinical sample of adolescents and adults
Victor, Sarah E.
Muehlenkamp, Jennifer J.
Hayes, Nicole A.
Lengel, Gregory J.
Styer, Denise M.
Washburn, Jason J.
COMPREHENSIVE PSYCHIATRY, 2018, 82 : 53 - 60

← 1 →