Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study

被引:0
|
作者
Kuerbanjiang, Warisijiang [1 ]
Peng, Shengzhe [1 ]
Jiamaliding, Yiershatijiang [1 ]
Yi, Yuexiong [1 ]
机构
[1] Wuhan Univ, Zhongnan Hosp, Dept Gynecol, 169 Donghu Rd, Wuhan 430071, Hubei, Peoples R China
关键词
large language model; cervical cancer; screening; artificial intelligence; model interpretability; GUIDELINES; CHATGPT;
D O I
10.2196/63626
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. Objective: This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. Methods: Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context. Results: Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. Conclusions: Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
引用
收藏
页数:19
相关论文
共 50 条
  • [11] A COMPARATIVE STUDY: PERFORMANCE OF LARGE LANGUAGE MODELS IN SIMPLIFYING TURKISH COMPUTED TOMOGRAPHY REPORTS
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    JOURNAL OF ISTANBUL FACULTY OF MEDICINE-ISTANBUL TIP FAKULTESI DERGISI, 2024, 87 (04): : 321 - 326
  • [12] Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
    Wei, Boxiong
    JMIR MEDICAL EDUCATION, 2025, 11
  • [13] Performance of large language models (LLMs) in providing prostate cancer information
    Alasker, Ahmed
    Alsalamah, Seham
    Alshathri, Nada
    Almansour, Nura
    Alsalamah, Faris
    Alghafees, Mohammad
    Alkhamees, Mohammad
    Alsaikhan, Bader
    BMC UROLOGY, 2024, 24 (01):
  • [14] Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study
    Giannakopoulos, Kostis
    Kavadella, Argyro
    Salim, Anas Aaqel
    Stamatopoulos, Vassilis
    Kaklamanos, Eleftherios G.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [15] Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education
    Sabri, Hamoun
    Saleh, Muhammad H. A.
    Hazrati, Parham
    Merchant, Keith
    Misch, Jonathan
    Kumar, Purnima S.
    Wang, Hom-Lay
    Barootchi, Shayan
    JOURNAL OF PERIODONTAL RESEARCH, 2025, 60 (02) : 121 - 133
  • [16] Evaluation of the performance of large language models in clinical decision-making in endodontics
    Yağız Özbay
    Deniz Erdoğan
    Gözde Akbal Dinçer
    BMC Oral Health, 25 (1)
  • [17] Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability
    Cao, Jennie J.
    Kwon, Daniel H.
    Ghaziani, Tara T.
    Kwo, Paul
    Tse, Gary
    Kesselman, Andrew
    Kamaya, Aya
    Tse, Justin R.
    ABDOMINAL RADIOLOGY, 2024, 49 (12) : 4286 - 4294
  • [18] Performance of Retrieval-Augmented Large Language Models to Recommend Head and Neck Cancer Clinical Trials
    Hung, Tony K. W.
    Kuperman, Gilad J.
    Sherman, Eric J.
    Ho, Alan L.
    Weng, Chunhua
    Pfister, David G.
    Mao, Jun J.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [19] Application of Large Language Models in Medical Training Evaluation-Using ChatGPT as a Standardized Patient: Multimetric Assessment
    Wang, Chenxu
    Li, Shuhan
    Lin, Nuoxi
    Zhang, Xinyu
    Han, Ying
    Wang, Xiandi
    Liu, Di
    Tan, Xiaomei
    Pu, Dan
    Li, Kang
    Qian, Guangwu
    Yin, Rong
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [20] Patient- and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy
    Trapp, Christian
    Schmidt-Hegemann, Nina
    Keilholz, Michael
    Brose, Sarah Frederike
    Marschner, Sebastian N.
    Schoenecker, Stephan
    Maier, Sebastian H.
    Dehelean, Diana-Coralia
    Rottler, Maya
    Konnerth, Dinah
    Belka, Claus
    Corradini, Stefanie
    Rogowski, Paul
    STRAHLENTHERAPIE UND ONKOLOGIE, 2025, 201 (03) : 333 - 342