Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan

被引:1
作者
Harigai, Ayaka [1 ,2 ]
Toyama, Yoshitaka [1 ]
Nagano, Mitsutoshi [3 ]
Abe, Mirei [1 ,2 ]
Kawabata, Masahiro [2 ]
Li, Li [4 ]
Yamamura, Jin [5 ]
Takase, Kei [2 ]
机构
[1] Tohoku Univ Hosp, Dept Diagnost Radiol, 1-1 Seiryo Machi,Aoba Ku, Sendai 9808575, Japan
[2] Tohoku Univ, Dept Diagnost Radiol, Grad Sch Med, 1-1 Seiryo Machi,Aoba Ku, Sendai, Miyagi, Japan
[3] Tohoku Univ Hosp, Grad Med Educ Ctr, 1-1 Seiryo Machi,Aoba Ku, Sendai, Miyagi, Japan
[4] Tohoku Med & Pharmaceut Univ, Div Radiol, 1-15-1 Fukumuro,Miyagino Ku, Sendai, Miyagi, Japan
[5] Univ Med Ctr Hamburg Eppendorf, Ctr Radiol & Endoscopy, Dept Diagnost & Intervent Radiol & Nucl Med, Martinistr 52, D-20246 Hamburg, Germany
关键词
GPT-4; Prompt; Radiology board examination; Linguistic variation; Translation quality;
D O I
10.1007/s11604-024-01673-6
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
PurposeThis study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions.Materials and methodsWe analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020-2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann-Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4's performance was assessed by linear regression analysis.ResultsThe median scores (interquartile range) for the 146 questions were 70 (68-72) (Japanese), 89 (84.5-95.5) (GPT-4 English), 64 (55.5-67) (Chinese), and 56 (46.5-67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079).ConclusionGPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4's response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.
引用
收藏
页码:319 / 329
页数:11
相关论文
共 23 条
[11]  
D'orsi C., 2013, Breast imaging reporting and data system: ACR BI-RADS breast imaging atlas
[12]  
Hendrycks D., 2021, P INT C LEARN REPR I
[13]   Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine [J].
Lee, Peter ;
Bubeck, Sebastien ;
Petro, Joseph .
NEW ENGLAND JOURNAL OF MEDICINE, 2023, 388 (13) :1233-1239
[14]  
Licht D., 2022, P 15 BIENN C ASS MAC, P309
[15]   Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: A novel approach to medical knowledge assessment [J].
Panthier, C. ;
Gatinel, D. .
JOURNAL FRANCAIS D OPHTALMOLOGIE, 2023, 46 (07) :706-711
[16]   Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany [J].
Roos, Jonas ;
Kasapovic, Adnan ;
Jansen, Tom ;
Kaczmarczyk, Robert .
JMIR MEDICAL EDUCATION, 2023, 9
[17]  
statista, Languages Most Frequently Used for Web Content as of January 2024, by Share of Websites
[18]   Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study [J].
Takagi, Soshi ;
Watari, Takashi ;
Erabi, Ayano ;
Sakaguchi, Kota .
JMIR MEDICAL EDUCATION, 2023, 9
[19]  
The Committee of Mammography Guideline (Japan Radiological Society Japanese Society of Radiological Technology), 2021, MAMMOGRAPHY GUIDELIN
[20]   Large language models in medicine [J].
Thirunavukarasu, Arun James ;
Ting, Darren Shu Jeng ;
Elangovan, Kabilan ;
Gutierrez, Laura ;
Tan, Ting Fang ;
Ting, Daniel Shu Wei .
NATURE MEDICINE, 2023, 29 (08) :1930-1940