Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology

被引:2
作者
Buhr, Christoph R. [1 ,2 ]
Smith, Harry [3 ]
Huppertz, Tilman [1 ]
Bahr-Hamm, Katharina [1 ]
Matthias, Christoph [1 ]
Cuny, Clemens [4 ]
Snijders, Jan Phillipp [4 ]
Ernst, Benjamin Philipp [5 ]
Blaikie, Andrew [2 ]
Kelsey, Tom [3 ]
Kuhn, Sebastian [6 ]
Eckrich, Jonas [1 ]
机构
[1] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Langenbeckstr 1, D-55131 Mainz, Rhineland Palat, Germany
[2] Univ St Andrews, Sch Med, St Andrews, Scotland
[3] Univ St Andrews, Sch Comp Sci, St Andrews, Scotland
[4] Outpatient Clin, Dieburg, Germany
[5] Univ Hosp Frankfurt, Dept Otorhinolaryngol, Frankfurt, Germany
[6] Philipps Univ Marburg, Univ Hosp Giessen & Marburg, Inst Digital Med, Marburg, Germany
关键词
Large language models; artificial intelligence; ChatGPT; Bard; Claude; otorhinolaryngology; digital health; chatbots; global health; chatbot; CHALLENGES; HEALTH;
D O I
10.1080/00016489.2024.2352843
中图分类号
R76 [耳鼻咽喉科学];
学科分类号
100213 ;
摘要
Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
引用
收藏
页码:237 / 242
页数:6
相关论文
共 50 条
  • [22] Potential Multidisciplinary Use of Large Language Models for Addressing Queries in Cardio-Oncology
    Li, Pengfei
    Zhang, Xuejuan
    Zhu, Erjia
    Yu, Shijun
    Sheng, Bin
    Tham, Yih Chung
    Wong, Tien Yin
    Ji, Hongwei
    JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2024, 13 (06):
  • [23] Advancing radiology practice and research: harnessing the potential of large language models amidst imperfections
    Klang, Eyal
    Alper, Lee
    Sorin, Vera
    Barash, Yiftach
    Nadkarni, Girish N.
    Zimlichman, Eyal
    BJR OPEN, 2024, 6 (01):
  • [24] Evidence-Based Potential of Generative Artificial Intelligence Large Language Models on Dental Avulsion: ChatGPT Versus Gemini
    Kaplan, Taibe Tokgoz
    Cankar, Muhammet
    DENTAL TRAUMATOLOGY, 2025, 41 (02) : 178 - 186
  • [25] Potential of Large Language Models in Health Care: Delphi Study
    Denecke, Kerstin
    May, Richard
    Romero, Octavio Rivera
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [26] The potential of Large Language Models for social robots in special education
    Voultsiou, Evdokia
    Vrochidou, Eleni
    Moussiades, Lefteris
    Papakostas, George A.
    PROGRESS IN ARTIFICIAL INTELLIGENCE, 2025, : 165 - 189
  • [27] Efficacy of large language models and their potential in Obstetrics and Gynecology education
    Eoh, Kyung Jin
    Kwon, Gu Yeun
    Lee, Eun Jin
    Lee, Joonho
    Lee, Inha
    Kim, Young Tae
    Nam, Eun Ji
    OBSTETRICS & GYNECOLOGY SCIENCE, 2024, 67 (06) : 550 - 556
  • [28] The rise of large language models in the medical field: A bibliometric analysis
    Qi, Wenhao
    Cao, Shihua
    Wang, Bin
    Zhu, Xiaohong
    Dong, Chaoqun
    He, Danni
    Chen, Yanfei
    Shi, Yankai
    Wang, BingSheng
    PROCEEDINGS 2024 IEEE INTERNATIONAL WORKSHOP ON FOUNDATION MODELS FOR CYBER-PHYSICAL SYSTEMS & INTERNET OF THINGS, FMSYS 2024, 2024, : 56 - 62
  • [29] Exploring the role of Large Language Models in haematology: A focused review of applications, benefits and limitations
    Mudrik, Aya
    Nadkarni, Girish N.
    Efros, Orly
    Glicksberg, Benjamin S.
    Klang, Eyal
    Soffer, Shelly
    BRITISH JOURNAL OF HAEMATOLOGY, 2024, 205 (05) : 1685 - 1698
  • [30] Assessing the research landscape and clinical utility of large language models: a scoping review
    Park, Ye-Jean
    Pillai, Abhinav
    Deng, Jiawen
    Guo, Eddie
    Gupta, Mehul
    Paget, Mike
    Naugler, Christopher
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)