Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan

被引：1

作者：

Harigai, Ayaka ^{[1
,2
]}

Toyama, Yoshitaka ^{[1
]}

Nagano, Mitsutoshi ^{[3
]}

Abe, Mirei ^{[1
,2
]}

Kawabata, Masahiro ^{[2
]}

Li, Li ^{[4
]}

Yamamura, Jin ^{[5
]}

Takase, Kei ^{[2
]}

机构：

[1] Tohoku Univ Hosp, Dept Diagnost Radiol, 1-1 Seiryo Machi,Aoba Ku, Sendai 9808575, Japan

[2] Tohoku Univ, Dept Diagnost Radiol, Grad Sch Med, 1-1 Seiryo Machi,Aoba Ku, Sendai, Miyagi, Japan

[3] Tohoku Univ Hosp, Grad Med Educ Ctr, 1-1 Seiryo Machi,Aoba Ku, Sendai, Miyagi, Japan

[4] Tohoku Med & Pharmaceut Univ, Div Radiol, 1-15-1 Fukumuro,Miyagino Ku, Sendai, Miyagi, Japan

[5] Univ Med Ctr Hamburg Eppendorf, Ctr Radiol & Endoscopy, Dept Diagnost & Intervent Radiol & Nucl Med, Martinistr 52, D-20246 Hamburg, Germany

来源：

JAPANESE JOURNAL OF RADIOLOGY | 2025年 / 43卷 / 02期

关键词：

GPT-4; Prompt; Radiology board examination; Linguistic variation; Translation quality;

D O I：

10.1007/s11604-024-01673-6

中图分类号：

R8 [特种医学]; R445 [影像诊断学];

学科分类号：

1002 ; 100207 ; 1009 ;

摘要：

PurposeThis study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions.Materials and methodsWe analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020-2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann-Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4's performance was assessed by linear regression analysis.ResultsThe median scores (interquartile range) for the 146 questions were 70 (68-72) (Japanese), 89 (84.5-95.5) (GPT-4 English), 64 (55.5-67) (Chinese), and 56 (46.5-67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079).ConclusionGPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4's response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.

引用

页码：319 / 329

页数：11

共 23 条

[11]

D'orsi C., 2013, Breast imaging reporting and data system: ACR BI-RADS breast imaging atlas

[12]

Hendrycks D., 2021, P INT C LEARN REPR I

[13] Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine [J].

Lee, Peter ;

Bubeck, Sebastien ;

Petro, Joseph .

NEW ENGLAND JOURNAL OF MEDICINE, 2023, 388 (13) :1233-1239

[14]

Licht D., 2022, P 15 BIENN C ASS MAC, P309

[15] Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: A novel approach to medical knowledge assessment [J].

Panthier, C. ;

Gatinel, D. .

JOURNAL FRANCAIS D OPHTALMOLOGIE, 2023, 46 (07) :706-711

[16] Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany [J].

Roos, Jonas ;

Kasapovic, Adnan ;

Jansen, Tom ;

Kaczmarczyk, Robert .

JMIR MEDICAL EDUCATION, 2023, 9

[17]

statista, Languages Most Frequently Used for Web Content as of January 2024, by Share of Websites

[18] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study [J].

Takagi, Soshi ;

Watari, Takashi ;

Erabi, Ayano ;

Sakaguchi, Kota .

JMIR MEDICAL EDUCATION, 2023, 9

[19]

The Committee of Mammography Guideline (Japan Radiological Society Japanese Society of Radiological Technology), 2021, MAMMOGRAPHY GUIDELIN

[20] Large language models in medicine [J].

Thirunavukarasu, Arun James ;

Ting, Darren Shu Jeng ;

Elangovan, Kabilan ;

Gutierrez, Laura ;

Tan, Ting Fang ;

Ting, Daniel Shu Wei .

NATURE MEDICINE, 2023, 29 (08) :1930-1940

← 1 2 3 →