Analyzing evaluation methods for large language models in the medical field: a scoping review

被引:0
作者
Lee, Junbok [1 ,2 ]
Park, Sungkyung [3 ]
Shin, Jaeyong [4 ,5 ]
Cho, Belong [2 ,6 ,7 ]
机构
[1] Yonsei Univ, Inst Innovat Digital Healthcare, Seoul, South Korea
[2] Seoul Natl Univ, Coll Med, Dept Human Syst Med, Seoul, South Korea
[3] Seoul Natl Univ Sci & Technol, Dept Bigdata AI Management Informat, Seoul, South Korea
[4] Yonsei Univ, Dept Prevent Med & Publ Hlth, Coll Med, 50-1 Yonsei Ro, Seoul 03722, South Korea
[5] Yonsei Univ, Coll Med, Inst Hlth Serv Res, Seoul, South Korea
[6] Seoul Natl Univ Hosp, Dept Family Med, Seoul, South Korea
[7] Seoul Natl Univ, Coll Med, Dept Family Med, 101 Daehak Ro, Seoul 03080, South Korea
关键词
Large language model; LLM; Evaluation methods; CHATGPT; PERFORMANCE; QUESTIONS; EDUCATION; ACCURACY;
D O I
10.1186/s12911-024-02709-7
中图分类号
R-058 [];
学科分类号
摘要
BackgroundOwing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs.ObjectiveThis study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies.Methods & materialsWe conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy.ResultsA total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering.ConclusionsMore research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review
    Aster, Alexandra
    Laupichler, Matthias Carl
    Rockwell-Kollmann, Tamina
    Masala, Gilda
    Bala, Ebru
    Raupach, Tobias
    MEDICAL SCIENCE EDUCATOR, 2025, 35 (01) : 555 - 567
  • [2] Large language models in patient education: a scoping review of applications in medicine
    Aydin, Serhat
    Karabacak, Mert
    Vlachos, Victoria
    Margetis, Konstantinos
    FRONTIERS IN MEDICINE, 2024, 11
  • [3] Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review
    Holmes, Glenn
    Tang, Biya
    Gupta, Sunil
    Venkatesh, Svetha
    Christensen, Helen
    Whitton, Alexis
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [4] Large Language Models in Healthcare and Medical Domain: A Review
    Nazi, Zabir Al
    Peng, Wei
    INFORMATICS-BASEL, 2024, 11 (03):
  • [5] Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
    Kaewboonlert, Naritsaret
    Poontananggul, Jiraphon
    Pongsuwan, Natthipong
    Bhakdisongkhram, Gun
    JMIR MEDICAL EDUCATION, 2025, 11
  • [6] The application of large language models in medicine: A scoping review
    Meng, Xiangbin
    Yan, Xiangyu
    Zhang, Kuo
    Liu, Da
    Cui, Xiaojuan
    Yang, Yaodong
    Zhang, Muhan
    Cao, Chunxia
    Wang, Jingjia
    Wang, Xuliang
    Gao, Jun
    Wang, Yuan-Geng-Shuo
    Ji, Jia-ming
    Qiu, Zifeng
    Li, Muzi
    Qian, Cheng
    Guo, Tianze
    Ma, Shuangquan
    Wang, Zeying
    Guo, Zexuan
    Lei, Youlan
    Shao, Chunli
    Wang, Wenyao
    Fan, Haojun
    Tang, Yi-Da
    ISCIENCE, 2024, 27 (05)
  • [7] Practical and ethical challenges of large language models in education: A systematic scoping review
    Yan, Lixiang
    Sha, Lele
    Zhao, Linxuan
    Li, Yuheng
    Martinez-Maldonado, Roberto
    Chen, Guanliang
    Li, Xinyu
    Jin, Yueqiao
    Gasevic, Dragan
    BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 2024, 55 (01) : 90 - 112
  • [8] Large Language Models in Traditional Chinese Medicine: A Scoping Review
    Ren, Yaxuan
    Luo, Xufei
    Wang, Ye
    Li, Haodong
    Zhang, Hairong
    Li, Zeming
    Lai, Honghao
    Li, Xuanlin
    Ge, Long
    Estill, Janne
    Zhang, Lu
    Yang, Shu
    Chen, Yaolong
    Wen, Chengping
    Bian, Zhaoxiang
    ADVANCED Working Group
    JOURNAL OF EVIDENCE BASED MEDICINE, 2025, 18 (01)
  • [9] Unlocking the Potentials of Large Language Models in Orthodontics: A Scoping Review
    Zheng, Jie
    Ding, Xiaoqian
    Pu, Jingya Jane
    Chung, Sze Man
    Ai, Qi Yong H.
    Hung, Kuo Feng
    Shan, Zhiyi
    BIOENGINEERING-BASEL, 2024, 11 (11):
  • [10] Examining the Role of Large Language Models in Orthopedics:Systematic Review
    Zhang, Cheng
    Liu, Shanshan
    Zhou, Xingyu
    Zhou, Siyu
    Tian, Yinglun
    Wang, Shenglin
    Xu, Nanfang
    Li, Weishi
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26