Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study

被引:0
作者
Kuerbanjiang, Warisijiang [1 ]
Peng, Shengzhe [1 ]
Jiamaliding, Yiershatijiang [1 ]
Yi, Yuexiong [1 ]
机构
[1] Wuhan Univ, Zhongnan Hosp, Dept Gynecol, 169 Donghu Rd, Wuhan 430071, Hubei, Peoples R China
关键词
large language model; cervical cancer; screening; artificial intelligence; model interpretability; GUIDELINES; CHATGPT;
D O I
10.2196/63626
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. Objective: This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. Methods: Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context. Results: Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. Conclusions: Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Exploring the Application of Large Language Models in Instrumentation and Control Engineering Education: A Questionnaire Survey and Examination Performance Analysis
    Xu, Wu
    Wei, Zhang
    Yan, Peng
    EUROPEAN JOURNAL OF EDUCATION, 2025, 60 (01)
  • [42] Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2
    Deng, Linfang
    Wang, Tianyi
    Yangzhang
    Zhai, Zhenhua
    Tao, Wei
    Li, Jincheng
    Zhao, Yi
    Luo, Shaoting
    Xu, Jinjiang
    INTERNATIONAL JOURNAL OF SURGERY, 2024, 110 (04) : 1941 - 1950
  • [43] Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context
    Piao, Ying
    Chen, Hongtao
    Wu, Shihai
    Li, Xianming
    Li, Zihuang
    Yang, Dong
    DIGITAL HEALTH, 2024, 10
  • [44] Artificial Intelligence in Academic Translation: A Comparative Study of Large Language Models and Google Translate
    Mohsen, Mohammed Ali
    PSYCHOLINGUISTICS, 2024, 35 (02): : 134 - 156
  • [45] A Comparative Study of Chatbot Response Generation: Traditional Approaches Versus Large Language Models
    McTear, Michael
    Marokkie, Sheen Varghese
    Bi, Yaxin
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, KSEM 2023, 2023, 14118 : 70 - 79
  • [46] A Comparative Study of Error Correction Patterns between Large Language Models and Native Speakers for Korean Language Learners
    Nam, Sinhye
    JOURNAL OF THE INTERNATIONAL NETWORK FOR KOREAN LANGUAGE AND CULTURE, 2024, 21 (03): : 29 - 52
  • [47] Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination
    Liu, Mingxin
    Okuhara, Tsuyoshi
    Dai, Zhehao
    Huang, Wenbo
    Gu, Lin
    Okada, Hiroko
    Furukawa, Emi
    Kiuchi, Takahiro
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2025, 193
  • [48] Performance of Large Language Models in Patient Complaint Resolution: Web-Based Cross-Sectional Survey
    Yong, Lorraine Pei Xian
    Tung, Joshua Yi Min
    Lee, Zi Yao
    Kuan, Win Sen
    Chua, Mui Teng
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [49] Integrating large language models in mental health practice: a qualitative descriptive study based on expert interviews
    Ma, Yingzhuo
    Zeng, Yi
    Liu, Tong
    Sun, Ruoshan
    Xiao, Mingzhao
    Wang, Jun
    FRONTIERS IN PUBLIC HEALTH, 2024, 12
  • [50] Enhancing Code Security Through Open-Source Large Language Models: A Comparative Study
    Ridley, Norah
    Branca, Enrico
    Kimber, Jadyn
    Stakhanova, Natalia
    FOUNDATIONS AND PRACTICE OF SECURITY, PT I, FPS 2023, 2024, 14551 : 233 - 249