A Brief Review on Benchmarking for Large Language Models Evaluation in Healthcare

被引：1

作者：

Budler, Leona Cilar ^{[1
]}

Chen, Hongyu ^{[2
]}

Chen, Aokun ^{[2
]}

Topaz, Maxim ^{[3
,4
]}

Tam, Wilson ^{[5
]}

Bian, Jiang ^{[6
,7
]}

Stiglic, Gregor ^{[1
,8
,9
]}

机构：

[1] Univ Maribor, Fac Hlth Sci, Maribor, Slovenia

[2] Univ Florida, Coll Med, Dept Hlth Outcomes & Biomed Informat, Gainesville, FL USA

[3] Columbia Univ, Sch Nursing, New York, NY USA

[4] Columbia Univ, Data Sci Inst, New York, NY USA

[5] Natl Univ Singapore, Alice Lee Ctr Nursing Studies, Level 5,MD6, Singapore, Singapore

[6] Regenstrief Inst Hlth Care, Indianapolis, IN USA

[7] Indiana Univ, Sch Med, Dept Biostat & Hlth Data Sci, Indianapolis, IN USA

[8] Univ Maribor, Fac Elect Engn & Comp Sci, Maribor, Slovenia

[9] Univ Edinburgh, Usher Inst, Edinburgh, Scotland

来源：

WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY | 2025年 / 15卷 / 02期

关键词：

artificial intelligence; benchmarking; chatbots; healthcare; large language models; natural language processing; OF-THE-ART; LONGITUDINAL CLINICAL NARRATIVES; DE-IDENTIFICATION; INFORMATION; RESOURCE; CORPUS;

D O I：

10.1002/widm.70010

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper reviews benchmarking methods for evaluating large language models (LLMs) in healthcare settings. It highlights the importance of rigorous benchmarking to ensure LLMs' safety, accuracy, and effectiveness in clinical applications. The review also discusses the challenges of developing standardized benchmarks and metrics tailored to healthcare-specific tasks such as medical text generation, disease diagnosis, and patient management. Ethical considerations, including privacy, data security, and bias, are also addressed, underscoring the need for multidisciplinary collaboration to establish robust benchmarking frameworks that facilitate LLMs' reliable and ethical use in healthcare. Evaluation of LLMs remains challenging due to the lack of standardized healthcare-specific benchmarks and comprehensive datasets. Key concerns include patient safety, data privacy, model bias, and better explainability, all of which impact the overall trustworthiness of LLMs in clinical settings.

引用

页数：16

共 180 条

[1]

Aali Asad, 2024, PN, DOI 10.13026/41ET-8342

[2]

Al Ghadban Y., 2023, medRxiv, DOI [10.1101/2023.12.15.23300009, DOI 10.1101/2023.12.15.23300009]

[3]

Al Nazi Z, 2024, Arxiv, DOI [arXiv:2401.06775, 10.48550/arXiv.2401.06775, DOI 10.48550/ARXIV.2401.06775]

[4]

[Anonymous], 1996, HLTH INSURANCE PORTA

[5]

[Anonymous], 2005, ACL WORKSHOP INTRINS

[6] A methodology for developing simulation models of complex systems [J].

Aumann, Craig A. .

ECOLOGICAL MODELLING, 2007, 202 (3-4) :385-396

[7] A Bibliometric Analysis of the Rise of ChatGPT in Medical Research [J].

Barrington, Nikki M. ;

Gupta, Nithin ;

Musmar, Basel ;

Doyle, David ;

Panico, Nicholas ;

Godbole, Nikhil ;

Reardon, Taylor ;

D'Amico, Randy S. .

MEDICAL SCIENCES, 2023, 11 (03)

[8] Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review [J].

Bedi, Suhana ;

Liu, Yutong ;

Orr-Ewing, Lucy ;

Dash, Dev ;

Koyejo, Sanmi ;

Callahan, Alison ;

Fries, Jason A. ;

Wornow, Michael ;

Swaminathan, Akshay ;

Lehmann, Lisa Soleymani ;

Hong, Hyo Jung ;

Kashyap, Mehr ;

Chaurasia, Akash R. ;

Shah, Nirav R. ;

Singh, Karandeep ;

Tazbaz, Troy ;

Milstein, Arnold ;

Pfeffer, Michael A. ;

Shah, Nigam H. .

JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2025, 333 (04) :319-328

[9]

Ben Abacha A., 2017, TEXT RETR C

[10]

Ben Abacha A, 2019, SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), P370

← 1 2 3 4 5 6 7 8 9 10 →