Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions

被引：0

作者：

Severino, Joao Victor Bruneti ^{[1
,2
]}

de Paula, Pedro Angelo Basei

Berger, Matheus Nespolo ^{[1
]}

Loures, Filipe Silveira ^{[3
]}

Todeschini, Solano Amadori ^{[3
]}

Roeder, Eduardo Augusto ^{[1
,3
]}

Veiga, Maria Han ^{[4
]}

Guedes, Murilo ^{[2
]}

Marques, Gustavo Lenci ^{[1
,2
,3
]}

机构：

[1] Univ Fed Parana, Curitiba, Brazil

[2] Pontificia Univ Catolica Parana, Curitiba, Brazil

[3] Voa Hlth, Belo Horizonte, Brazil

[4] Ohio State Univ, Math, Columbus, OH USA

来源：

BMJ HEALTH & CARE INFORMATICS | 2025年 / 32卷 / 01期

关键词：

Artificial intelligence; Health Equity; Machine Learning; Medical Informatics Applications; Universal Health Care;

D O I：

10.1136/bmjhci-2024-101195

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Objective The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.Methods This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.Results Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8x7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.Conclusions 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.

引用

页数：4

共 10 条

[1] ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model [J].

Ebrahimian, Manoochehr ;

Behnam, Behdad ;

Ghayebi, Negin ;

Sobhrakhshankhah, Elham .

BMJ HEALTH & CARE INFORMATICS, 2023, 30 (01)

[2] Performance of large language models on advocating the management of meningitis: a comparative qualitative stud [J].

Fisch, Urs ;

Kliem, Paulina ;

Grzonka, Pascale ;

Sutter, Raoul .

BMJ HEALTH & CARE INFORMATICS, 2024, 31 (01)

[3]

Instituto Nacional de Estudos e Pesquisas Educacionais Anisio Teixeira, 2024, Inep. Painel Revalida

[4]

Karen M., 2016, The promises and perils of digital strategies in achieving health equity

[5]

Pal A, 2024, Hugging Face

[6] Comparative study of ChatGPT and human evaluators on the assessment of medical literature according to recognised reporting standards [J].

Roberts, Richard H. R. ;

Ali, Stephen R. ;

Hutchings, Hayley A. ;

Dobbs, Thomas D. ;

Whitaker, Iain S. .

BMJ HEALTH & CARE INFORMATICS, 2023, 30 (01)

[7]

Tan Yang, 2024, Comput Biol Med, V172, P108290, DOI [10.1016/j.compbiomed.2024.108290, 10.1016/j.compbiomed.2024.108290]

[8] Global health inequities: more challenges, some solutions [J].

Tangcharoensathien, Viroj ;

Lekagul, Angkana ;

Teo, Yik-Ying .

BULLETIN OF THE WORLD HEALTH ORGANIZATION, 2024, 102 (02) :86-+

[9] Deciphering CO oxidation on SnO2 nanosheets: A multinuclear solid-state NMR spectroscopic approach [J].

Wang, Xiang ;

Qi, Guodong ;

Wang, Qiang ;

Xu, Jun ;

Deng, Feng .

ARKIVOC, 2024, :1-14

[10]

Wu S, 2024, NEJM AI, V1, DOI [10.1056/aidbp2300092, 10.1056/AIdbp2300092, DOI 10.1056/AIDBP2300092]

← 1 →