RETRACTED: New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology (Retracted Article)

被引:46
作者
Huynh, Linda My [1 ]
Bonebrake, Benjamin T. [2 ]
Schultis, Kaitlyn [2 ]
Quach, Alan [3 ]
Deibert, Christopher M. [3 ,4 ]
机构
[1] Univ Nebraska Med Ctr, Omaha, NE USA
[2] Univ Nebraska Med Ctr, Coll Med, Omaha, NE USA
[3] Univ Nebraska Med Ctr, Div Urol, Omaha, NE USA
[4] Univ Nebraska Med Ctr, Dept Surg, Div Urol, 987521 Nebraska Med Ctr, Omaha, NE 68198 USA
关键词
artificial intelligence; medical informatics applications; urology;
D O I
10.1097/UPJ.0000000000000406
中图分类号
R5 [内科学]; R69 [泌尿科学(泌尿生殖系疾病)];
学科分类号
1002 ; 100201 ;
摘要
Introduction:Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians.Methods:One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning.Results:ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.Conclusions:ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.
引用
收藏
页码:408 / +
页数:8
相关论文
共 23 条
[1]   Enabling Artificial Intelligence for Genome Sequence Analysis of COVID-19 and Alike Viruses [J].
Ahmed, Imran ;
Jeon, Gwanggil .
INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES, 2022, 14 (02) :504-519
[2]  
American Urological Association, 2022, SELF ASS STUD PROGR
[3]   AI did not write this manuscript, or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in Sports & Exercise Medicine manuscript generation [J].
Anderson, Nash ;
Belavy, Daniel L. ;
Perle, Stephen M. ;
Hendricks, Sharief ;
Hespanhol, Luiz ;
Verhagen, Evert ;
Memon, Aamir R. .
BMJ OPEN SPORT & EXERCISE MEDICINE, 2023, 9 (01)
[4]   The Potential Use of Radiomics with Pre-Radiation Therapy MR Imaging in Predicting Risk of Pseudoprogression in Glioblastoma Patients [J].
Baine, Michael ;
Burr, Justin ;
Du, Qian ;
Zhang, Chi ;
Liang, Xiaoying ;
Krajewski, Luke ;
Zima, Laura ;
Rux, Gerard ;
Zheng, Dandan .
JOURNAL OF IMAGING, 2021, 7 (02)
[5]  
Choi JH., 2303 MINN LEG STUD R
[6]   A New Readability Yardstick [J].
Flesch, Rudolf .
JOURNAL OF APPLIED PSYCHOLOGY, 1948, 32 (03) :221-233
[7]   How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment [J].
Gilson, Aidan ;
Safranek, Conrad W. ;
Huang, Thomas ;
Socrates, Vimig ;
Chi, Ling ;
Taylor, Richard Andrew ;
Chartash, David .
JMIR MEDICAL EDUCATION, 2023, 9
[8]   Classification of SARS-CoV-2 viral genome sequences using Neurochaos Learning [J].
Harikrishnan, N. B. ;
Pranay, S. Y. ;
Nagaraj, Nithin .
MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2022, 60 (08) :2245-2255
[9]   Are ChatGPT's knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study [J].
Huh, Sun .
JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS, 2023, 20
[10]  
IBM Corp, 2021, IBM SPSS Statistics for Windows