Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

被引:9
作者
Iannantuono, Giovanni Maria [1 ]
Bracken-Clarke, Dara [2 ]
Karzai, Fatima [1 ]
Choo-Wosoba, Hyoyoung [3 ]
Gulley, James L. [2 ]
Floudas, Charalampos S. [2 ]
机构
[1] NCI, Genitourinary Malignancies Branch, Ctr Canc Res, NIH, Bethesda, MD USA
[2] NCI, Ctr Immunooncol, Ctr Canc Res, NIH, Bethesda, MD USA
[3] NCI, Biostat & Data Management Sect, Ctr Canc Res, NIH, Bethesda, MD USA
基金
美国国家卫生研究院;
关键词
large language models; artificial intelligence; immuno-oncology; ChatGPT; Google Bard;
D O I
10.1093/oncolo/oyae009
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Background: The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation of their potential as educational and management tools for patients with cancer and healthcare providers. Materials and Methods: We conducted a cross-sectional study aimed at evaluating the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to 4 domains of immuno-oncology (Mechanisms, Indications, Toxicities, and Prognosis). We generated 60 open-ended questions (15 for each section). Questions were manually submitted to LLMs, and responses were collected on June 30, 2023. Two reviewers evaluated the answers independently. Results: ChatGPT-4 and ChatGPT-3.5 answered all questions, whereas Google Bard answered only 53.3% (P < .0001). The number of questions with reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT3.5 (88.3%) than for Google Bard (50%) (P < .0001). In terms of accuracy, the number of answers deemed fully correct were 75.4%, 58.5%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (P = .03). Furthermore, the number of responses deemed highly relevant was 71.9%, 77.4%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (P = .04). Regarding readability, the number of highly readable was higher for ChatGPT-4 and ChatGPT-3.5 (98.1%) and (100%) compared to Google Bard (87.5%) (P = .02). Conclusion: ChatGPT-4 and ChatGPT-3.5 are potentially powerful tools in immuno-oncology, whereas Google Bard demonstrated relatively poorer performance. However, the risk of inaccuracy or incompleteness in the responses was evident in all 3 LLMs, highlighting the importance of expert-driven verification of the outputs returned by these technologies.
引用
收藏
页码:407 / 414
页数:8
相关论文
共 37 条
  • [1] Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools
    Al-Ashwal, Fahmi Y.
    Zawiah, Mohammed
    Gharaibeh, Lobna
    Abu-Farha, Rana
    Bitar, Ahmad Naoras
    [J]. DRUG HEALTHCARE AND PATIENT SAFETY, 2023, 15 : 137 - 147
  • [2] [Anonymous], What is natural language processing?
  • [3] [Anonymous], 2009, BING
  • [4] Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
    Ayers, John W.
    Poliak, Adam
    Dredze, Mark
    Leas, Eric C.
    Zhu, Zechariah
    Kelley, Jessica B.
    Faix, Dennis J.
    Goodman, Aaron M.
    Longhurst, Christopher A.
    Hogarth, Michael
    Smith, Davey M.
    [J]. JAMA INTERNAL MEDICINE, 2023, 183 (06) : 589 - 596
  • [5] Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, 10.48550/arXiv.2005.14165, DOI 10.48550/ARXIV.2005.14165]
  • [6] Science in the age of large language models
    Birhane, Abeba
    Kasirzadeh, Atoosa
    Leslie, David
    Wachter, Sandra
    [J]. NATURE REVIEWS PHYSICS, 2023, 5 (05) : 277 - 280
  • [7] ChatGPT: five priorities for research
    Bockting, Claudi
    van Dis, Eva A. M.
    Bollen, Johan
    van Rooij, Robert
    Zuidema, Willem L.
    [J]. NATURE, 2023, 614 (7947) : 224 - 226
  • [8] Immune checkpoint inhibitors: recent progress and potential biomarkers
    Darvin, Pramod
    Toor, Salman M.
    Nair, Varun Sasidharan
    Elkord, Eyad
    [J]. EXPERIMENTAL AND MOLECULAR MEDICINE, 2018, 50 : 1 - 11
  • [9] Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology
    Dhanvijay, Anup Kumar D.
    Pinjar, Mohammed Jaffer
    Dhokane, Nitin
    Sorte, Smita R.
    Kumari, Amita
    Mondal, Himel
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [10] Google, Try Bard, an AI experiment by Google