Analyzing evaluation methods for large language models in the medical field: a scoping review

被引：0

作者：

Lee, Junbok ^{[1
,2
]}

Park, Sungkyung ^{[3
]}

Shin, Jaeyong ^{[4
,5
]}

Cho, Belong ^{[2
,6
,7
]}

机构：

[1] Yonsei Univ, Inst Innovat Digital Healthcare, Seoul, South Korea

[2] Seoul Natl Univ, Coll Med, Dept Human Syst Med, Seoul, South Korea

[3] Seoul Natl Univ Sci & Technol, Dept Bigdata AI Management Informat, Seoul, South Korea

[4] Yonsei Univ, Dept Prevent Med & Publ Hlth, Coll Med, 50-1 Yonsei Ro, Seoul 03722, South Korea

[5] Yonsei Univ, Coll Med, Inst Hlth Serv Res, Seoul, South Korea

[6] Seoul Natl Univ Hosp, Dept Family Med, Seoul, South Korea

[7] Seoul Natl Univ, Coll Med, Dept Family Med, 101 Daehak Ro, Seoul 03080, South Korea

来源：

BMC MEDICAL INFORMATICS AND DECISION MAKING | 2024年 / 24卷 / 01期

关键词：

Large language model; LLM; Evaluation methods; CHATGPT; PERFORMANCE; QUESTIONS; EDUCATION; ACCURACY;

D O I：

10.1186/s12911-024-02709-7

中图分类号：

R-058 [];

学科分类号：

摘要：

BackgroundOwing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs.ObjectiveThis study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies.Methods & materialsWe conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy.ResultsA total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering.ConclusionsMore research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.

引用

页数：11

共 50 条

[1] ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review
Aster, Alexandra
Laupichler, Matthias Carl
Rockwell-Kollmann, Tamina
Masala, Gilda
Bala, Ebru
Raupach, Tobias
MEDICAL SCIENCE EDUCATOR, 2025, 35 (01) : 555 - 567
[2] Large language models in patient education: a scoping review of applications in medicine
Aydin, Serhat
Karabacak, Mert
Vlachos, Victoria
Margetis, Konstantinos
FRONTIERS IN MEDICINE, 2024, 11
[3] Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review
Holmes, Glenn
Tang, Biya
Gupta, Sunil
Venkatesh, Svetha
Christensen, Helen
Whitton, Alexis
JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
[4] Large Language Models in Healthcare and Medical Domain: A Review
Nazi, Zabir Al
Peng, Wei
INFORMATICS-BASEL, 2024, 11 (03):
[5] Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
Kaewboonlert, Naritsaret
Poontananggul, Jiraphon
Pongsuwan, Natthipong
Bhakdisongkhram, Gun
JMIR MEDICAL EDUCATION, 2025, 11
[6] The application of large language models in medicine: A scoping review
Meng, Xiangbin
Yan, Xiangyu
Zhang, Kuo
Liu, Da
Cui, Xiaojuan
Yang, Yaodong
Zhang, Muhan
Cao, Chunxia
Wang, Jingjia
Wang, Xuliang
Gao, Jun
Wang, Yuan-Geng-Shuo
Ji, Jia-ming
Qiu, Zifeng
Li, Muzi
Qian, Cheng
Guo, Tianze
Ma, Shuangquan
Wang, Zeying
Guo, Zexuan
Lei, Youlan
Shao, Chunli
Wang, Wenyao
Fan, Haojun
Tang, Yi-Da
ISCIENCE, 2024, 27 (05)
[7] Practical and ethical challenges of large language models in education: A systematic scoping review
Yan, Lixiang
Sha, Lele
Zhao, Linxuan
Li, Yuheng
Martinez-Maldonado, Roberto
Chen, Guanliang
Li, Xinyu
Jin, Yueqiao
Gasevic, Dragan
BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 2024, 55 (01) : 90 - 112
[8] Large Language Models in Traditional Chinese Medicine: A Scoping Review
Ren, Yaxuan
Luo, Xufei
Wang, Ye
Li, Haodong
Zhang, Hairong
Li, Zeming
Lai, Honghao
Li, Xuanlin
Ge, Long
Estill, Janne
Zhang, Lu
Yang, Shu
Chen, Yaolong
Wen, Chengping
Bian, Zhaoxiang
ADVANCED Working Group
JOURNAL OF EVIDENCE BASED MEDICINE, 2025, 18 (01)
[9] Unlocking the Potentials of Large Language Models in Orthodontics: A Scoping Review
Zheng, Jie
Ding, Xiaoqian
Pu, Jingya Jane
Chung, Sze Man
Ai, Qi Yong H.
Hung, Kuo Feng
Shan, Zhiyi
BIOENGINEERING-BASEL, 2024, 11 (11):
[10] Examining the Role of Large Language Models in Orthopedics:Systematic Review
Zhang, Cheng
Liu, Shanshan
Zhou, Xingyu
Zhou, Siyu
Tian, Yinglun
Wang, Shenglin
Xu, Nanfang
Li, Weishi
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26

← 1 2 3 4 5 →