Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology

被引:5
作者
Buhr, Christoph R. [1 ,2 ]
Smith, Harry [3 ]
Huppertz, Tilman [1 ]
Bahr-Hamm, Katharina [1 ]
Matthias, Christoph [1 ]
Cuny, Clemens [4 ]
Snijders, Jan Phillipp [4 ]
Ernst, Benjamin Philipp [5 ]
Blaikie, Andrew [2 ]
Kelsey, Tom [3 ]
Kuhn, Sebastian [6 ]
Eckrich, Jonas [1 ]
机构
[1] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Langenbeckstr 1, D-55131 Mainz, Rhineland Palat, Germany
[2] Univ St Andrews, Sch Med, St Andrews, Scotland
[3] Univ St Andrews, Sch Comp Sci, St Andrews, Scotland
[4] Outpatient Clin, Dieburg, Germany
[5] Univ Hosp Frankfurt, Dept Otorhinolaryngol, Frankfurt, Germany
[6] Philipps Univ Marburg, Univ Hosp Giessen & Marburg, Inst Digital Med, Marburg, Germany
关键词
Large language models; artificial intelligence; ChatGPT; Bard; Claude; otorhinolaryngology; digital health; chatbots; global health; chatbot; CHALLENGES; HEALTH;
D O I
10.1080/00016489.2024.2352843
中图分类号
R76 [耳鼻咽喉科学];
学科分类号
100213 ;
摘要
Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
引用
收藏
页码:237 / 242
页数:6
相关论文
共 50 条
[41]   ChatGPT and Large Language Models in Radiology: Perspectives From the Field [J].
Bahl, Manisha ;
Balthazar, Patricia ;
Davis, Melissa A. ;
Makary, Mina S. ;
Tirumani, Sree Harsha ;
Whitlow, Christopher T. .
AMERICAN JOURNAL OF ROENTGENOLOGY, 2024, 223 (04)
[42]   The rise of large language models in the medical field: A bibliometric analysis [J].
Qi, Wenhao ;
Cao, Shihua ;
Wang, Bin ;
Zhu, Xiaohong ;
Dong, Chaoqun ;
He, Danni ;
Chen, Yanfei ;
Shi, Yankai ;
Wang, BingSheng .
PROCEEDINGS 2024 IEEE INTERNATIONAL WORKSHOP ON FOUNDATION MODELS FOR CYBER-PHYSICAL SYSTEMS & INTERNET OF THINGS, FMSYS 2024, 2024, :56-62
[43]   Exploring the role of Large Language Models in haematology: A focused review of applications, benefits and limitations [J].
Mudrik, Aya ;
Nadkarni, Girish N. ;
Efros, Orly ;
Glicksberg, Benjamin S. ;
Klang, Eyal ;
Soffer, Shelly .
BRITISH JOURNAL OF HAEMATOLOGY, 2024, 205 (05) :1685-1698
[44]   Assessing the Accuracy of Diagnostic Capabilities of Large Language Models [J].
Urda-Cimpean, Andrada Elena ;
Leucuta, Daniel-Corneliu ;
Drugan, Cristina ;
Dutu, Alina-Gabriela ;
Calinici, Tudor ;
Drugan, Tudor .
DIAGNOSTICS, 2025, 15 (13)
[45]   Quality evaluation of large language models in answering open-ended questions in the field of benign prostatic hyperplasia [J].
Wang, Chengbang ;
Liu, Zijia ;
Hao, Liangshi ;
Chen, Shaohua ;
Guo, Rongchang ;
Wei, Lai ;
Ji, Kaiyuan ;
Wang, Fubo ;
Xu, Bin .
DISPLAYS, 2025, 90
[46]   Assessing the research landscape and clinical utility of large language models: a scoping review [J].
Park, Ye-Jean ;
Pillai, Abhinav ;
Deng, Jiawen ;
Guo, Eddie ;
Gupta, Mehul ;
Paget, Mike ;
Naugler, Christopher .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
[47]   Assessing the research landscape and clinical utility of large language models: a scoping review [J].
Ye-Jean Park ;
Abhinav Pillai ;
Jiawen Deng ;
Eddie Guo ;
Mehul Gupta ;
Mike Paget ;
Christopher Naugler .
BMC Medical Informatics and Decision Making, 24
[48]   Benchmarking AutoGen with different large language models [J].
Barbarroxa, Rafael ;
Ribeiro, Bruno ;
Gomes, Luis ;
Vale, Zita .
2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, :263-264
[49]   Exploring the Potential of Large Language Models to Understand Interpersonal Emotion Regulation Strategies From Narratives [J].
Lopez-Perez, Belen ;
Chen, Yuhui ;
Li, Xiuhui ;
Cheng, Shixing ;
Razavi, Pooya .
EMOTION, 2025,
[50]   Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study [J].
Dede, Burak Tayyip ;
Oguz, Muhammed ;
Alyanak, Bulent ;
Bagcier, Fatih ;
Yildizgoren, Mustafa Turgut .
HSS JOURNAL, 2025, 21 (03) :342-347