The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries

被引:8
作者
Cung, Michelle [1 ]
Sosa, Branden [1 ]
Yang, He S. [1 ]
McDonald, Michelle M. [2 ,3 ,4 ]
Matthews, Brya G. [5 ,6 ]
Vlug, Annegreet G. [7 ]
Imel, Erik A. [8 ]
Wein, Marc N. [9 ]
Stein, Emily Margaret [10 ,11 ,12 ]
Greenblatt, Matthew B. [1 ,12 ]
机构
[1] Weill Cornell Med Coll, Dept Pathol & Lab Med, 1300 York Ave, New York, NY 10065 USA
[2] Garvan Inst Med Res, Skeletal Dis Program, Darlinghurst, Australia
[3] Univ New South Wales, St Vincents Clin Campus Sch Clin Med, Kensington 2052, Australia
[4] Univ Sydney, Sch Med Sci, Sydney 2006, Australia
[5] Univ Auckland, Dept Mol Med & Pathol, Auckland 1142, New Zealand
[6] Ctr Regenerat Med & Skeletal Dev, Sch Dent Med, UConn Hlth, Farmington, CT 06030 USA
[7] Leiden Univ, Ctr Bone Qual, Med Ctr, NL-2300 Leiden, Netherlands
[8] Indiana Univ Sch Med, Indiana Ctr Musculoskeletal Hlth, Indianapolis, IN 46202 USA
[9] Massachusetts Gen Hosp, Endocrine Unit, Boston, MA 02114 USA
[10] Hosp Special Surg, Div Endocrinol, New York, NY 10021 USA
[11] Hosp Special Surg, Metab Bone Serv, New York, NY 10021 USA
[12] Hosp Special Surg, Res Div, New York, NY 10021 USA
关键词
artificial intelligence; large language models; ChatGPT; BingAI; Bard; skeletal biology; CELLS;
D O I
10.1093/jbmr/zjad007
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Artificial intelligence (AI) chatbots utilizing large language models (LLMs) have recently garnered significant interest due to their ability to generate humanlike responses to user inquiries in an interactive dialog format. While these models are being increasingly utilized to obtain medical information by patients, scientific and medical providers, and trainees to address biomedical questions, their performance may vary from field to field. The opportunities and risks these chatbots pose to the widespread understanding of skeletal health and science are unknown. Here we assess the performance of 3 high-profile LLM chatbots, Chat Generative Pre-Trained Transformer (ChatGPT) 4.0, BingAI, and Bard, to address 30 questions in 3 categories: basic and translational skeletal biology, clinical practitioner management of skeletal disorders, and patient queries to assess the accuracy and quality of the responses. Thirty questions in each of these categories were posed, and responses were independently graded for their degree of accuracy by four reviewers. While each of the chatbots was often able to provide relevant information about skeletal disorders, the quality and relevance of these responses varied widely, and ChatGPT 4.0 had the highest overall median score in each of the categories. Each of these chatbots displayed distinct limitations that included inconsistent, incomplete, or irrelevant responses, inappropriate utilization of lay sources in a professional context, a failure to take patient demographics or clinical context into account when providing recommendations, and an inability to consistently identify areas of uncertainty in the relevant literature. Careful consideration of both the opportunities and risks of current AI chatbots is needed to formulate guidelines for best practices for their use as source of information about skeletal health and biology. Artificial intelligence chatbots are increasingly used as a source of information in health care and research settings due to their accessibility and ability to summarize complex topics using conversational language. However, it is still unclear whether they can provide accurate information for questions related to the medicine and biology of the skeleton. Here, we tested the performance of three prominent chatbots-ChatGPT, Bard, and BingAI-by tasking them with a series of prompts based on well-established skeletal biology concepts, realistic physician-patient scenarios, and potential patient questions. Despite their similarities in function, differences in the accuracy of responses were observed across the three different chatbot services. While in some contexts, chatbots performed well, and in other cases, strong limitations were observed, including inconsistent consideration of clinical context and patient demographics, occasionally providing incorrect or out-of-date information, and citation of inappropriate sources. With careful consideration of their current weaknesses, artificial intelligence chatbots offer the potential to transform education on skeletal health and science.
引用
收藏
页码:106 / 115
页数:10
相关论文
共 32 条
[1]  
2023, Arxiv, DOI arXiv:2303.08774
[2]   Artificial Hallucinations in ChatGPT: Implications in Scientific Writing [J].
Alkaissi, Hussam ;
McFarlane, Samy I. .
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (02)
[3]   Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum [J].
Ayers, John W. ;
Poliak, Adam ;
Dredze, Mark ;
Leas, Eric C. ;
Zhu, Zechariah ;
Kelley, Jessica B. ;
Faix, Dennis J. ;
Goodman, Aaron M. ;
Longhurst, Christopher A. ;
Hogarth, Michael ;
Smith, Davey M. .
JAMA INTERNAL MEDICINE, 2023, 183 (06) :589-596
[4]   Medicine in the Era of Artificial Intelligence Hey Chatbot, Write Me an H&P [J].
Brender, Teva D. D. .
JAMA INTERNAL MEDICINE, 2023, 183 (06) :507-508
[5]   User-Chatbot Conversations During the COVID-19 Pandemic: Study Based on Topic Modeling and Sentiment Analysis [J].
Chin, Hyojin ;
Lima, Gabriel ;
Shin, Mingi ;
Zhunis, Assem ;
Cha, Chiyoung ;
Choi, Junghoi ;
Cha, Meeyoung .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
[6]   ChatGPT Performance on the American Urological Association Self-assessment Study Program and the Potential Influence of Artificial Intelligence in Urologic Training [J].
Deebel, Nicholas A. ;
Terlecki, Ryan .
UROLOGY, 2023, 177 :29-33
[7]   Targeting cellular senescence prevents age-related bone loss in mice [J].
Farr, Joshua N. ;
Xu, Ming ;
Weivoda, Megan M. ;
Monroe, David G. ;
Fraser, Daniel G. ;
Onken, Jennifer L. ;
Negley, Brittany A. ;
Sfeir, Jad G. ;
Ogrodnik, Mikolaj B. ;
Hachfeld, Christine M. ;
LeBrasseur, Nathan K. ;
Drake, Matthew T. ;
Pignolo, Robert J. ;
Pirtskhalava, Tamar ;
Tchkonia, Tamara ;
Oursler, Merry Jo ;
Kirkland, James L. ;
Khosla, Sundeep .
NATURE MEDICINE, 2017, 23 (09) :1072-+
[8]   A Case Study Demonstrating Applications of ChatGPT in the Clinical Management of Treatment-Resistant Schizophrenia [J].
Galido, Pearl Valentine ;
Butala, Saloni ;
Chakerian, Meg ;
Agustines, Davin .
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (04)
[9]   How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment [J].
Gilson, Aidan ;
Safranek, Conrad W. ;
Huang, Thomas ;
Socrates, Vimig ;
Chi, Ling ;
Taylor, Richard Andrew ;
Chartash, David .
JMIR MEDICAL EDUCATION, 2023, 9
[10]   The evolutionary history of lethal metastatic prostate cancer [J].
Gundem, Gunes ;
Van Loo, Peter ;
Kremeyer, Barbara ;
Alexandrov, Ludmil B. ;
Tubio, Jose M. C. ;
Papaemmanuil, Elli ;
Brewer, Daniel S. ;
Kallio, Heini M. L. ;
Hoegnas, Gunilla ;
Annala, Matti ;
Kivinummi, Kati ;
Goody, Victoria ;
Latimer, Calli ;
O'Meara, Sarah ;
Dawson, Kevin J. ;
Isaacs, William ;
Emmert-Buck, Michael R. ;
Nykter, Matti ;
Foster, Christopher ;
Kote-Jarai, Zsofia ;
Easton, Douglas ;
Whitaker, Hayley C. ;
Neal, David E. ;
Cooper, Colin S. ;
Eeles, Rosalind A. ;
Visakorpi, Tapio ;
Campbell, Peter J. ;
McDermott, Ultan ;
Wedge, David C. ;
Bova, G. Steven .
NATURE, 2015, 520 (7547) :353-+