Adherence of Studies on Large Language Models for Medical Applications Published in Leading Medical Journals According to the MI-CLEAR-LLM Checklist

被引：0

作者：

Ko, Ji Su ^{[1
,2
]}

Heo, Hwon ^{[3
]}

Suh, Chong Hyun ^{[1
,2
]}

Yi, Jeho ^{[4
]}

Shim, Woo Hyun ^{[1
,2
,3
]}

机构：

[1] Univ Ulsan, Asan Med Ctr, Coll Med, Dept Radiol, 88 Olymp Ro 43 Gil, Seoul 05505, South Korea

[2] Univ Ulsan, Res Inst Radiol, Coll Med, Asan Med Ctr, 88 Olymp Ro 43 Gil, Seoul 05505, South Korea

[3] Univ Ulsan, Asan Med Inst Convergence Sci & Technol, Asan Med Ctr, Dept Med Sci,Coll Med, Seoul, South Korea

[4] Univ Ulsan, Coll Med, Asan Med Lib, Seoul, South Korea

来源：

KOREAN JOURNAL OF RADIOLOGY | 2025年 / 26卷 / 04期

关键词：

Large language model; Large multimodal model; Chatbot; Generative; Artificial intelligence; Deep learning; Reporting; Guideline; Checklist; Standard; Adherence; Quality; ARTIFICIAL-INTELLIGENCE; REPORTING GUIDELINES;

D O I：

10.3348/kjr.2024.1161

中图分类号：

R8 [特种医学]; R445 [影像诊断学];

学科分类号：

1002 ; 100207 ; 1009 ;

摘要：

Objective: To evaluate the adherence of large language model (LLM)-based healthcare research to the Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM) checklist, a framework designed to enhance the transparency and reproducibility of studies on the accuracy of LLMs for medical applications. Materials and Methods: A systematic PubMed search was conducted to identify articles on LLM performance published in high-ranking clinical medicine journals (the top 10% in each of the 59 specialties according to the 2023 Journal Impact Factor) from November 30, 2022, through June 25, 2024. Data on the six MI-CLEAR-LLM checklist items: 1) identification and specification of the LLM used, 2) stochasticity handling, 3) prompt wording and syntax, 4) prompt structuring, 5) prompt testing and optimization, and 6) independence of the test data-were independently extracted by two reviewers, and adherence was calculated for each item. Results: Of 159 studies, 100% (159/159) reported the name of the LLM, 96.9% (154/159) reported the version, and 91.8% documented access to web-based information, and 50.9% (81/159) provided the date of the query attempts. Clear documentation regarding stochasticity management was provided in 15.1% (24/159) of the studies. Regarding prompt details, 49.1% (78/159) provided exact prompt wording and syntax but only 34.0% (54/159) documented prompt-structuring practices. While 46.5% (74/159) of the studies detailed prompt testing, only 15.7% (25/159) explained the rationale for specific word choices. Test data independence was reported for only 13.2% (21/159) of the studies, and 56.6% (43/76) provided URLs for internet-sourced test data. Conclusion: Although basic LLM identification details were relatively well reported, other key aspects, including stochasticity, prompts, and test data, were frequently underreported. Enhancing adherence to the MI-CLEAR-LLM checklist will allow LLM research to achieve greater transparency and will foster more credible and reliable future studies.

引用

页码：304 / 312

页数：9

共 25 条

[1] Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review
Bedi, Suhana
Liu, Yutong
Orr-Ewing, Lucy
Dash, Dev
Koyejo, Sanmi
Callahan, Alison
Fries, Jason A.
Wornow, Michael
Swaminathan, Akshay
Lehmann, Lisa Soleymani
Hong, Hyo Jung
Kashyap, Mehr
Chaurasia, Akash R.
Shah, Nirav R.
Singh, Karandeep
Tazbaz, Troy
Milstein, Arnold
Pfeffer, Michael A.
Shah, Nigam H.
[J]. JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2025, 333 (04): : 319 - 328
[2] Belz A, 2021, Arxiv, DOI [arXiv:2103.07929, 10.48550/arXiv.2103.07929, DOI 10.48550/ARXIV.2103.07929]
[3] ChatGPT: standard reporting guidelines for responsible use
Cacciamani, Giovanni E.
Gill, Inderbir S.
Collins, Gary S.
[J]. NATURE, 2023, 618 (7964) : 238 - 238
[4] Protocol for the development of the Chatbot Assessment Reporting Tool (CHART) for clinical advice
CHART Collaborative
Huo, Bright
McKechnie, Tyler
Chartash, David
Marshall, Iain J.
Moher, David
Ng, Jeremy Y.
Loder, Elizabeth
Feeney, Timothy
Chan, An-Wen
Berkwits, Michael
Flanagin, Annette
Antoniou, Stavros A.
Laine, Christine
Cacciamani, Giovanni E.
Collins, Gary S.
Saha, Shirbani
Mathur, Piyush
Iorio, Alfonso
Lee, Yung
Samuel, Diana
Frankish, Helen
Ortenzi, Monica
Mayol, Julio
Lokker, Cynthia
Agoritsas, Thomas
Vandvik, Per Olav
Foroutan, Farid
Meerpohl, Joerg J.
Campos, Hugo
Canfield, Carolyn
Luo, Xufei
Chen, Yaolong
Harvey, Hugh
Loeb, Stacy
Agha, Riaz
Ramji, Karim
Ahmed, Hassaan
Boudreau, Vanessa
Guyatt, Gordon
[J]. BMJ OPEN, 2024, 14 (05):
[5] Gallifant J, 2024, PREPRINT, DOI [10.1101/2024.07.24.24310930, DOI 10.1101/2024.07.24.24310930]
[6] Gilson Aidan, 2023, JMIR Med Educ, V9, pe45312, DOI 10.2196/45312
[7] Gundersen OE, 2018, AAAI CONF ARTIF INTE, P1644
[8] Reproducibility standards for machine learning in the life sciences
Heil, Benjamin J.
Hoffman, Michael M.
Markowetz, Florian
Lee, Su-In
Greene, Casey S.
Hicks, Stephanie C.
[J]. NATURE METHODS, 2021, 18 (10) : 1132 - 1135
[9] Reporting standards for the use of large language model-linked chatbots for health advice
Huo, Bright
Cacciamani, Giovanni E.
Collins, Gary S.
Mckechnie, Tyler
Lee, Yung
Guyatt, Gordon
[J]. NATURE MEDICINE, 2023, 29 (12) : 2988 - 2988
[10] Hutson M, 2018, SCIENCE, V359, P725, DOI 10.1126/science.359.6377.725

← 1 2 3 →