Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study

被引:0
|
作者
Kuerbanjiang, Warisijiang [1 ]
Peng, Shengzhe [1 ]
Jiamaliding, Yiershatijiang [1 ]
Yi, Yuexiong [1 ]
机构
[1] Wuhan Univ, Zhongnan Hosp, Dept Gynecol, 169 Donghu Rd, Wuhan 430071, Hubei, Peoples R China
关键词
large language model; cervical cancer; screening; artificial intelligence; model interpretability; GUIDELINES; CHATGPT;
D O I
10.2196/63626
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. Objective: This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. Methods: Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context. Results: Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. Conclusions: Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Comparative Evaluation of Large Language Models for Translating Radiology Reports into Hindi
    Gupta, Amit
    Rastogi, Ashish
    Malhotra, Hema
    Rangarajan, Krithika
    INDIAN JOURNAL OF RADIOLOGY AND IMAGING, 2025, 35 (01) : 88 - 96
  • [22] Comparative Evaluation of the Accuracies of Large Language Models in Answering VI-RADS-Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (08) : 767 - 768
  • [23] Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
    Wilhelm, Theresa Isabelle
    Roos, Jonas
    Kaczmarczyk, Robert
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [24] Beyond Topic Modeling: Comparative Evaluation of Topic Interpretation by Large Language Models
    de Melo, Tiago
    Merialdo, Paolo
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 4, INTELLISYS 2024, 2024, 1068 : 215 - 230
  • [25] Clinical performance of human papillomavirus based cervical cancer screening algorithm: The result of a large Danish implementation study
    Lindquist, Sofie
    Kjaer, Susanne K.
    Frederiksen, Kirsten
    Ornskov, Dorthe
    Petersen, Lone Kjeld
    Munk, Christian
    Waldstrom, Marianne
    ACTA OBSTETRICIA ET GYNECOLOGICA SCANDINAVICA, 2024, 103 (09) : 1781 - 1788
  • [26] Utilizing large language models in breast cancer management: systematic review
    Sorin, Vera
    Glicksberg, Benjamin S.
    Artsi, Yaara
    Barash, Yiftach
    Konen, Eli
    Nadkarni, Girish N.
    Klang, Eyal
    JOURNAL OF CANCER RESEARCH AND CLINICAL ONCOLOGY, 2024, 150 (03)
  • [27] Utilizing large language models in breast cancer management: systematic review
    Vera Sorin
    Benjamin S. Glicksberg
    Yaara Artsi
    Yiftach Barash
    Eli Konen
    Girish N. Nadkarni
    Eyal Klang
    Journal of Cancer Research and Clinical Oncology, 150
  • [28] A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Chen, John J.
    Iezzi, Raymond
    Olsen, Timothy W.
    Scruggs, Brittni A.
    Barkmeier, Andrew J.
    Bakri, Sophie J.
    Ryan, Edwin H.
    Tang, Peter H.
    Parke, D. Wilkin., III
    Belin, Peter J.
    Sridhar, Jayanth
    Xu, David
    Kuriyan, Ajay E.
    Yonekawa, Yoshihiro
    Starr, Matthew R.
    OPHTHALMOLOGY SCIENCE, 2024, 4 (04):
  • [29] A Comparative Study in Large Language Models Usage for Fake News Detection
    Emil, Repede Stefan
    Brad, Remus
    ADVANCES IN ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, 2024, 4 (04): : 2810 - 2823
  • [30] Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study
    Cardamone, Nicholas C.
    Olfson, Mark
    Schmutte, Timothy
    Ungar, Lyle
    Liu, Tony
    Cullen, Sara W.
    Williams, Nathaniel J.
    Marcus, Steven C.
    JMIR MEDICAL INFORMATICS, 2025, 13