Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology

被引:2
|
作者
Buhr, Christoph R. [1 ,2 ]
Smith, Harry [3 ]
Huppertz, Tilman [1 ]
Bahr-Hamm, Katharina [1 ]
Matthias, Christoph [1 ]
Cuny, Clemens [4 ]
Snijders, Jan Phillipp [4 ]
Ernst, Benjamin Philipp [5 ]
Blaikie, Andrew [2 ]
Kelsey, Tom [3 ]
Kuhn, Sebastian [6 ]
Eckrich, Jonas [1 ]
机构
[1] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Langenbeckstr 1, D-55131 Mainz, Rhineland Palat, Germany
[2] Univ St Andrews, Sch Med, St Andrews, Scotland
[3] Univ St Andrews, Sch Comp Sci, St Andrews, Scotland
[4] Outpatient Clin, Dieburg, Germany
[5] Univ Hosp Frankfurt, Dept Otorhinolaryngol, Frankfurt, Germany
[6] Philipps Univ Marburg, Univ Hosp Giessen & Marburg, Inst Digital Med, Marburg, Germany
关键词
Large language models; artificial intelligence; ChatGPT; Bard; Claude; otorhinolaryngology; digital health; chatbots; global health; chatbot; CHALLENGES; HEALTH;
D O I
10.1080/00016489.2024.2352843
中图分类号
R76 [耳鼻咽喉科学];
学科分类号
100213 ;
摘要
Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
引用
收藏
页码:237 / 242
页数:6
相关论文
共 50 条
  • [1] Comparative Analysis of Information Quality in Pediatric Otorhinolaryngology: Clinicians, Residents, and Large Language Models
    Trecca, Eleonora M. C.
    Caponio, Vito Carlo Alberto
    Turri-Zanoni, Mario
    di Lullo, Antonella Miriam
    Gaffuri, Michele
    Lechien, Jerome R.
    Maniaci, Antonino
    Maruccio, Giuseppe
    Reale, Marella
    Visconti, Irene Claudia
    Dallari, Virginia
    OTOLARYNGOLOGY-HEAD AND NECK SURGERY, 2025,
  • [2] Assessing the Current Limitations of Large Language Models in Advancing Health Care Education
    Kim, Jaeyong
    Vajravelu, Bathri Narayan
    JMIR FORMATIVE RESEARCH, 2025, 9
  • [3] Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery
    Buhr, Christoph Raphael
    Ernst, Benjamin Philipp
    Blaikie, Andrew
    Smith, Harry
    Kelsey, Tom
    Matthias, Christoph
    Fleischmann, Maximilian
    Jungmann, Florian
    Alt, Juergen
    Brandts, Christian
    Kaemmerer, Peer W.
    Foersch, Sebastian
    Kuhn, Sebastian
    Eckrich, Jonas
    EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2025, 282 (03) : 1593 - 1607
  • [4] Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values
    Hadar-Shoval, Dorit
    Asraf, Kfir
    Mizrachi, Yonathan
    Haber, Yuval
    Elyoseph, Zohar
    JMIR MENTAL HEALTH, 2024, 11
  • [5] Assessing the Capability of Large Language Models in Naturopathy Consultation
    Mondal, Himel
    Komarraju, Satyalakshmi
    Sathyanath, D.
    Muralidharan, Shrikanth
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (05)
  • [6] Assessing the potential integration of large language models in accounting practices: evidence from an emerging economy
    Toumeh, Ahmad A.
    FUTURE BUSINESS JOURNAL, 2024, 10 (01)
  • [7] Understanding natural language: Potential application of large language models to ophthalmology
    Yang, Zefeng
    Wang, Deming
    Zhou, Fengqi
    Song, Diping
    Zhang, Yinhang
    Jiang, Jiaxuan
    Kong, Kangjie
    Liu, Xiaoyi
    Qiao, Yu
    Chang, Robert T.
    Han, Ying
    Li, Fei
    Tham, Clement C.
    Zhang, Xiulan
    ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (04):
  • [8] The potential and limitations of large language models in identification of the states of motivations for facilitating health behavior change
    Bak, Michelle
    Chin, Jessie
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 2047 - 2053
  • [9] Assessing large language models as assistive tools in medical consultations for Kawasaki disease
    Yan, Chunyi
    Li, Zexi
    Liang, Yongzhou
    Shao, Shuran
    Ma, Fan
    Zhang, Nanjun
    Li, Bowen
    Wang, Chuan
    Zhou, Kaiyu
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2025, 8
  • [10] Large language models and their big bullshit potential
    Fisher, Sarah A.
    ETHICS AND INFORMATION TECHNOLOGY, 2024, 26 (04)