Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study

被引:0
|
作者
Kuerbanjiang, Warisijiang [1 ]
Peng, Shengzhe [1 ]
Jiamaliding, Yiershatijiang [1 ]
Yi, Yuexiong [1 ]
机构
[1] Wuhan Univ, Zhongnan Hosp, Dept Gynecol, 169 Donghu Rd, Wuhan 430071, Hubei, Peoples R China
关键词
large language model; cervical cancer; screening; artificial intelligence; model interpretability; GUIDELINES; CHATGPT;
D O I
10.2196/63626
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. Objective: This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. Methods: Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context. Results: Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. Conclusions: Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Performance Assessment of Large Language Models in Medical Consultation: Comparative Study
    Seo, Sujeong
    Kim, Kyuli
    Yang, Heyoung
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [2] Performance of large language models on advocating the management of meningitis: a comparative qualitative stud
    Fisch, Urs
    Kliem, Paulina
    Grzonka, Pascale
    Sutter, Raoul
    BMJ HEALTH & CARE INFORMATICS, 2024, 31 (01)
  • [3] Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases
    Wang, Xueqi
    Ye, Haiyan
    Zhang, Sumian
    Yang, Mei
    Wang, Xuebin
    JOURNAL OF MEDICAL SYSTEMS, 2025, 49 (01)
  • [4] Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study
    Kim, Woojun
    Kim, Bong Chul
    Yeom, Han-Gyeol
    INTERNATIONAL DENTAL JOURNAL, 2025, 75 (01) : 176 - 184
  • [5] Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study
    de Oliveira, Adonias Caetano
    Bessa, Renato Freitas
    Soares, Ariel
    CADERNOS DE SAUDE PUBLICA, 2024, 40 (10):
  • [6] The Comparative Performance of Large Language Models on the Hand Surgery Self-Assessment Examination
    Chen, Clark J.
    Sobol, Keenan
    Hickey, Connor
    Raphael, James
    HAND-AMERICAN ASSOCIATION FOR HAND SURGERY, 2024,
  • [7] A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology
    Murthy, Aravind Baskar
    Palaniappan, Vijayasankar
    Radhakrishnan, Suganya
    Rajaa, Sathish
    Karthikeyan, Kaliaperumal
    INDIAN DERMATOLOGY ONLINE JOURNAL, 2025, 16 (02) : 241 - 247
  • [8] Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance
    Li, Jinze
    Chang, Chao
    Li, Yanqiu
    Cui, Shengyu
    Yuan, Fan
    Li, Zhuojun
    Wang, Xinyu
    Li, Kang
    Feng, Yuxin
    Wang, Zuowei
    Wei, Zhijian
    Jian, Fengzeng
    JOURNAL OF MEDICAL SYSTEMS, 2025, 49 (01)
  • [9] Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models
    Li, Kun-peng
    Wang, Li
    Wan, Shun
    Wang, Chen-yang
    Chen, Si-yu
    Liu, Shan-hui
    Yang, Li
    JOURNAL OF ENDOUROLOGY, 2025,
  • [10] Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study
    Masanneck, Lars
    Schmidt, Linea
    Seifert, Antonia
    Koelsche, Tristan
    Huntemann, Niklas
    Jansen, Robin
    Mehsin, Mohammed
    Bernhard, Michael
    Meuth, Sven G.
    Boehm, Lennert
    Pawlitzki, Marc
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26