Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan

被引:1
作者
Harigai, Ayaka [1 ,2 ]
Toyama, Yoshitaka [1 ]
Nagano, Mitsutoshi [3 ]
Abe, Mirei [1 ,2 ]
Kawabata, Masahiro [2 ]
Li, Li [4 ]
Yamamura, Jin [5 ]
Takase, Kei [2 ]
机构
[1] Tohoku Univ Hosp, Dept Diagnost Radiol, 1-1 Seiryo Machi,Aoba Ku, Sendai 9808575, Japan
[2] Tohoku Univ, Dept Diagnost Radiol, Grad Sch Med, 1-1 Seiryo Machi,Aoba Ku, Sendai, Miyagi, Japan
[3] Tohoku Univ Hosp, Grad Med Educ Ctr, 1-1 Seiryo Machi,Aoba Ku, Sendai, Miyagi, Japan
[4] Tohoku Med & Pharmaceut Univ, Div Radiol, 1-15-1 Fukumuro,Miyagino Ku, Sendai, Miyagi, Japan
[5] Univ Med Ctr Hamburg Eppendorf, Ctr Radiol & Endoscopy, Dept Diagnost & Intervent Radiol & Nucl Med, Martinistr 52, D-20246 Hamburg, Germany
关键词
GPT-4; Prompt; Radiology board examination; Linguistic variation; Translation quality;
D O I
10.1007/s11604-024-01673-6
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
PurposeThis study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions.Materials and methodsWe analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020-2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann-Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4's performance was assessed by linear regression analysis.ResultsThe median scores (interquartile range) for the 146 questions were 70 (68-72) (Japanese), 89 (84.5-95.5) (GPT-4 English), 64 (55.5-67) (Chinese), and 56 (46.5-67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079).ConclusionGPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4's response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.
引用
收藏
页码:319 / 329
页数:11
相关论文
共 23 条
[1]  
Achiam J., 2024, Gpt-4 technical report, DOI 10.48550/arXiv.2303.08774
[2]   A Comprehensive Study of ChatGPT: Advancements, Limitations, and Ethical Considerations in Natural Language Processing and Cybersecurity [J].
Alawida, Moatsum ;
Mejri, Sami ;
Mehmood, Abid ;
Chikhaoui, Belkacem ;
Abiodun, Oludare Isaac .
INFORMATION, 2023, 14 (08)
[3]  
Anderson L.W., 2001, A taxonomy for learning, teaching, and assessing: A revision of bloom's taxonomy of educational objectives
[4]   Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations [J].
Bhayana, Rajesh ;
Krishna, Satheesh ;
Bleakney, Robert R. .
RADIOLOGY, 2023, 307 (05)
[5]  
Bloom B. S., 1956, Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain, DOI DOI 10.1300/J104V03N01_03
[6]  
Brown TB, 2020, ADV NEUR IN, V33
[7]  
Cancer Stat Facts, SOFT TISSUE INCLUDIN
[8]  
Cancer Stat Facts, BONE JOINT CANC
[9]  
Chiswick BR., 2005, Journal of Multilingual and Multicultural Development, V26, P1, DOI [10.1080/14790710508668395, DOI 10.1080/14790710508668395]
[10]   Scaling neural machine translation to 200 languages [J].
Costa-Jussa, Marta R. ;
Cross, James ;
Celebi, Onur ;
Elbayad, Maha ;
Heafield, Kenneth ;
Heffernan, Kevin ;
Kalbassi, Elahe ;
Lam, Janice ;
Licht, Daniel ;
Maillard, Jean ;
Sun, Anna ;
Wang, Skyler ;
Wenzek, Guillaume ;
Youngblood, Al ;
Akula, Bapi ;
Barrault, Loic ;
Gonzalez, Gabriel Mejia ;
Hansanti, Prangthip ;
Hoffman, John ;
Jarrett, Semarley ;
Sadagopan, Kaushik Ram ;
Rowe, Dirk ;
Spruit, Shannon ;
Tran, Chau ;
Andrews, Pierre ;
Ayan, Necip Fazil ;
Bhosale, Shruti ;
Edunov, Sergey ;
Fan, Angela ;
Gao, Cynthia ;
Goswami, Vedanuj ;
Guzman, Francisco ;
Koehn, Philipp ;
Mourachko, Alexandre ;
Ropers, Christophe ;
Saleem, Safiyyah ;
Schwenk, Holger ;
Wang, Jeff .
NATURE, 2024, 630 (8018) :841-+