Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

被引:43
作者
Wei, Qiuhong [1 ,6 ,7 ]
Yao, Zhengxiong [2 ]
Cui, Ying [3 ]
Wei, Bo [4 ]
Jin, Zhezhen [5 ]
Xu, Ximing [1 ]
机构
[1] Chongqing Med Univ, Childrens Hosp, Big Data Ctr Childrens Med Care, 136 Zhongshan 2nd Rd, Chongqing 400014, Peoples R China
[2] Chongqing Med Univ, Dept Neurol, Childrens Hosp, Chongqing, Peoples R China
[3] Stanford Univ, Sch Med, Dept Biomed Data Sci, Stanford, CA USA
[4] BeiGene USA Inc, Dept Global Stat & Data Sci, San Mateo, CA USA
[5] Columbia Univ, Mailman Sch Publ Hlth, Dept Biostat, 722 West 168th St, New York, NY 10032 USA
[6] Chongqing Med Univ, Children Nutr Res Ctr, Childrens Hosp, Chongqing, Peoples R China
[7] Natl Clin Res Ctr Child Hlth & Disorders, Minist Educ,Key Lab Child Dev & Disorders, China Int Sci & Technol Cooperat Base Child Dev &, Key Lab Child Dev & Disorders,Chongqing Key Lab Ch, Chongqing, Peoples R China
关键词
ChatGPT; Large language model; Medicine; Evaluation; PERFORMANCE; EDUCATION; QUALITY; TOOL;
D O I
10.1016/j.jbi.2024.104620
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research. Methods: An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was "ChatGPT," without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327. Results: A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %- 60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, questionasking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency. Conclusion: This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.
引用
收藏
页数:10
相关论文
共 84 条
[1]   Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be? [J].
Alberts, Ian L. ;
Mercolli, Lorenzo ;
Pyka, Thomas ;
Prenosil, George ;
Shi, Kuangyu ;
Rominger, Axel ;
Afshar-Oromieh, Ali .
EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2023, 50 (06) :1549-1552
[2]   Artificial intelligence and anaesthesia examinations: exploring ChatGPT as a prelude to the future [J].
Aldridge, Matthew J. ;
Penders, Robert .
BRITISH JOURNAL OF ANAESTHESIA, 2023, 131 (02) :e36-e37
[3]   ChatGPT and Lacrimal Drainage Disorders: Performance and Scope of Improvement [J].
Ali, Mohammad Javed .
OPHTHALMIC PLASTIC AND RECONSTRUCTIVE SURGERY, 2023, 39 (03) :221-225
[4]   Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum [J].
Ayers, John W. ;
Poliak, Adam ;
Dredze, Mark ;
Leas, Eric C. ;
Zhu, Zechariah ;
Kelley, Jessica B. ;
Faix, Dennis J. ;
Goodman, Aaron M. ;
Longhurst, Christopher A. ;
Hogarth, Michael ;
Smith, Davey M. .
JAMA INTERNAL MEDICINE, 2023, 183 (06) :589-596
[5]   Comparison Between ChatGPT and Google Search as Sources of Postoperative Patient Instructions [J].
Ayoub, Noel F. ;
Lee, Yu-Jin ;
Grimm, David ;
Balakrishnan, Karthik .
JAMA OTOLARYNGOLOGY-HEAD & NECK SURGERY, 2023, 149 (06) :556-+
[6]   Can ChatGPT be used in oral and maxillofacial surgery? [J].
Balel, Yunus .
JOURNAL OF STOMATOLOGY ORAL AND MAXILLOFACIAL SURGERY, 2023, 124 (05)
[7]   Appropriateness of Recommendations Provided by ChatGPT to Interventional Radiologists [J].
Barat, Maxime ;
Soyer, Philippe ;
Dohan, Anthony .
CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2023, 74 (04) :758-763
[8]   Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations [J].
Bhayana, Rajesh ;
Krishna, Satheesh ;
Bleakney, Robert R. .
RADIOLOGY, 2023, 307 (05)
[9]   Accurate medium-range global weather forecasting with 3D neural networks [J].
Bi, Kaifeng ;
Xie, Lingxi ;
Zhang, Hengheng ;
Chen, Xin ;
Gu, Xiaotao ;
Tian, Qi .
NATURE, 2023, 619 (7970) :533-+
[10]   Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI) [J].
Cadamuro, Janne ;
Cabitza, Federico ;
Debeljak, Zeljko ;
De Bruyne, Sander ;
Frans, Glynis ;
Perez, Salomon Martin ;
Ozdemir, Habib ;
Tolios, Alexander ;
Carobene, Anna ;
Padoan, Andrea .
CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2023, 61 (07) :1158-1166