<hr>Evaluating a large language model's ability to answer clinicians' requests for evidence summaries

被引:0
作者
Blasingame, Mallory N. [1 ]
Koonce, Taneya Y. [1 ]
Williams, Annette M. [2 ]
Giuse, Dario A. [3 ,4 ]
Su, Jing [1 ]
Krump, Poppy A. [1 ]
Giuse, Nunzia Bettinsoli [1 ,5 ,6 ,7 ]
机构
[1] Vanderbilt Univ, Med Ctr, Ctr Knowledge Management, Nashville, TN 37235 USA
[2] Vanderbilt Univ, Ctr Knowledge Management, Med Ctr, Metadata Management, Nashville, TN USA
[3] Vanderbilt Univ, Sch Med, Dept Biomed Informat, Nashville, TN USA
[4] Vanderbilt Univ, Med Ctr, Nashville, TN USA
[5] Vanderbilt Univ, Ctr Knowledge Management, Med Ctr, Knowledge Management, Nashville, TN USA
[6] Vanderbilt Univ, Med Ctr, Biomed Informat, Nashville, TN USA
[7] Vanderbilt Univ, Med Ctr, Med, Nashville, TN USA
关键词
Large Language Models; LLMs; Generative AI; Artificial Intelligence; Evidence Synthesis; Library Information Science; Biomedical Informatics; INFORMATION NEEDS; CHATGPT; PROFESSIONALS; CARE;
D O I
10.5195/jmla.2025.1985
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Objective: This study investigated the performance of a generative artificial intelligence (AI) tool usingGPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses. Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat. Results: Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.
引用
收藏
页码:65 / 77
页数:13
相关论文
共 85 条
  • [1] Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions
    Abd-alrazaq, Alaa
    AlSaad, Rawan
    Alhuwail, Dari
    Ahmed, Arfan
    Healy, Padraig Mark
    Latifi, Syed
    Aziz, Sarah
    Damseh, Rafat
    Alrazak, Sadam Alabed
    Sheikh, Javaid
    [J]. JMIR MEDICAL EDUCATION, 2023, 9
  • [2] [Anonymous], Scopus AI: Trusted content. Powered by responsible AI
  • [3] Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
    Ayers, John W.
    Poliak, Adam
    Dredze, Mark
    Leas, Eric C.
    Zhu, Zechariah
    Kelley, Jessica B.
    Faix, Dennis J.
    Goodman, Aaron M.
    Longhurst, Christopher A.
    Hogarth, Michael
    Smith, Davey M.
    [J]. JAMA INTERNAL MEDICINE, 2023, 183 (06) : 589 - 596
  • [4] Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery
    Azadi, Ali
    Gorjinejad, Fatemeh
    Mohammad-Rahimi, Hossein
    Tabrizi, Reza
    Alam, Mostafa
    Golkar, Mohsen
    [J]. ORAL SURGERY ORAL MEDICINE ORAL PATHOLOGY ORAL RADIOLOGY, 2024, 137 (06): : 587 - 593
  • [5] Bansal M., 2024, MediumJan 10
  • [6] High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content
    Bhattacharyya, Mehul
    Miller, Valerie M.
    Bhattacharyya, Debjani
    Miller, Larry E.
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (05)
  • [7] Utilizing Artificial Intelligence and Chat Generative Pretrained Transformer to Answer Questions About Clinical Scenarios in Neuroanesthesiology
    Blacker, Samuel N.
    Kang, Mia
    Chakraborty, Indranil
    Chowdhury, Tumul
    Williams, James
    Lewis, Carol
    Zimmer, Michael
    Wilson, Brad
    Lele, Abhijit V.
    [J]. JOURNAL OF NEUROSURGICAL ANESTHESIOLOGY, 2024, 36 (04) : 346 - 351
  • [8] Blasingame MN, 2023, MED LIB ASS SPEC LIB
  • [9] Evaluating the performance of ChatGPT in answering questions related to urolithiasis
    Cakir, Hakan
    Caglar, Ufuk
    Yildiz, Oguzhan
    Meric, Arda
    Ayranci, Ali
    Ozgor, Faruk
    [J]. INTERNATIONAL UROLOGY AND NEPHROLOGY, 2024, 56 (01) : 17 - 21
  • [10] Consensus, AI search engine for research