Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study

被引:0
作者
Kuerbanjiang, Warisijiang [1 ]
Peng, Shengzhe [1 ]
Jiamaliding, Yiershatijiang [1 ]
Yi, Yuexiong [1 ]
机构
[1] Wuhan Univ, Zhongnan Hosp, Dept Gynecol, 169 Donghu Rd, Wuhan 430071, Hubei, Peoples R China
关键词
large language model; cervical cancer; screening; artificial intelligence; model interpretability; GUIDELINES; CHATGPT;
D O I
10.2196/63626
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. Objective: This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. Methods: Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context. Results: Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. Conclusions: Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Performance of Large Language Models ChatGPT and Gemini on Workplace Management Questions in Radiology
    Leutz-Schmidt, Patricia
    Palm, Viktoria
    Mathy, Rene Michael
    Groezinger, Martin
    Kauczor, Hans-Ulrich
    Jang, Hyungseok
    Sedaghat, Sam
    DIAGNOSTICS, 2025, 15 (04)
  • [32] Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages
    Zexi Li
    Chunyi Yan
    Ying Cao
    Aobo Gong
    Fanghui Li
    Rui Zeng
    Scientific Reports, 15 (1)
  • [33] Examining How the Large Language Models Impact the Conceptual Design with Human Designers: A Comparative Case Study
    Zhou, Zhibin
    Li, Jinxin
    Zhang, Zhijie
    Yu, Junnan
    Duh, Henry
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2024,
  • [34] Large Language Models Versus Expert Clinicians in CrisisPrediction Among Telemental Health Patients:Comparative Study
    Lee, Christine
    Mohebbi, Matthew
    Callaghan, Erin O'
    Winsberg, Mirene
    JMIR MENTAL HEALTH, 2024, 11
  • [35] Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study
    Lopez-Ubeda, Pilar
    Martin-Noguerol, Teodoro
    Diaz-Angulo, Carolina
    Luna, Antonio
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2024, 187
  • [36] Large language models and questions from older adults: a human and machine-based evaluation study
    Research Dawadi
    Thien Vu
    Jie Ting Tay
    Phap Tran Ngoc Hoang
    Ai Oya
    Masaki Yamamoto
    Naoki Watanabe
    Yuki Kuriya
    Michihiro Araki
    Discover Artificial Intelligence, 5 (1):
  • [37] Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023
    Khalpey, Zain
    Kumar, Ujjawal
    King, Nicholas
    Abraham, Alyssa
    Khalpey, Amina H.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (07)
  • [38] An AI Dietitian for Type 2 Diabetes Mellitus Management Based on Large Language and Image Recognition Models: Preclinical Concept Validation Study
    Sun, Haonan
    Zhang, Kai
    Lan, Wei
    Gu, Qiufeng
    Jiang, Guangxiang
    Yang, Xue
    Qin, Wanli
    Han, Dongran
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [39] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [40] Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
    Xu, Andrew Y.
    Singh, Manjot
    Balmaceno-Criss, Mariah
    Oh, Allison
    Leigh, David
    Daher, Mohammad
    Alsoof, Daniel
    Mcdonald, Christopher L.
    Diebo, Bassel G.
    Daniels, Alan H.
    JOURNAL OF ORTHOPAEDIC SURGERY, 2025, 33 (01)