Adherence of Studies on Large Language Models for Medical Applications Published in Leading Medical Journals According to the MI-CLEAR-LLM Checklist

被引:0
作者
Ko, Ji Su [1 ,2 ]
Heo, Hwon [3 ]
Suh, Chong Hyun [1 ,2 ]
Yi, Jeho [4 ]
Shim, Woo Hyun [1 ,2 ,3 ]
机构
[1] Univ Ulsan, Asan Med Ctr, Coll Med, Dept Radiol, 88 Olymp Ro 43 Gil, Seoul 05505, South Korea
[2] Univ Ulsan, Res Inst Radiol, Coll Med, Asan Med Ctr, 88 Olymp Ro 43 Gil, Seoul 05505, South Korea
[3] Univ Ulsan, Asan Med Inst Convergence Sci & Technol, Asan Med Ctr, Dept Med Sci,Coll Med, Seoul, South Korea
[4] Univ Ulsan, Coll Med, Asan Med Lib, Seoul, South Korea
关键词
Large language model; Large multimodal model; Chatbot; Generative; Artificial intelligence; Deep learning; Reporting; Guideline; Checklist; Standard; Adherence; Quality; ARTIFICIAL-INTELLIGENCE; REPORTING GUIDELINES;
D O I
10.3348/kjr.2024.1161
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: To evaluate the adherence of large language model (LLM)-based healthcare research to the Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM) checklist, a framework designed to enhance the transparency and reproducibility of studies on the accuracy of LLMs for medical applications. Materials and Methods: A systematic PubMed search was conducted to identify articles on LLM performance published in high-ranking clinical medicine journals (the top 10% in each of the 59 specialties according to the 2023 Journal Impact Factor) from November 30, 2022, through June 25, 2024. Data on the six MI-CLEAR-LLM checklist items: 1) identification and specification of the LLM used, 2) stochasticity handling, 3) prompt wording and syntax, 4) prompt structuring, 5) prompt testing and optimization, and 6) independence of the test data-were independently extracted by two reviewers, and adherence was calculated for each item. Results: Of 159 studies, 100% (159/159) reported the name of the LLM, 96.9% (154/159) reported the version, and 91.8% documented access to web-based information, and 50.9% (81/159) provided the date of the query attempts. Clear documentation regarding stochasticity management was provided in 15.1% (24/159) of the studies. Regarding prompt details, 49.1% (78/159) provided exact prompt wording and syntax but only 34.0% (54/159) documented prompt-structuring practices. While 46.5% (74/159) of the studies detailed prompt testing, only 15.7% (25/159) explained the rationale for specific word choices. Test data independence was reported for only 13.2% (21/159) of the studies, and 56.6% (43/76) provided URLs for internet-sourced test data. Conclusion: Although basic LLM identification details were relatively well reported, other key aspects, including stochasticity, prompts, and test data, were frequently underreported. Enhancing adherence to the MI-CLEAR-LLM checklist will allow LLM research to achieve greater transparency and will foster more credible and reliable future studies.
引用
收藏
页码:304 / 312
页数:9
相关论文
共 25 条
  • [1] Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review
    Bedi, Suhana
    Liu, Yutong
    Orr-Ewing, Lucy
    Dash, Dev
    Koyejo, Sanmi
    Callahan, Alison
    Fries, Jason A.
    Wornow, Michael
    Swaminathan, Akshay
    Lehmann, Lisa Soleymani
    Hong, Hyo Jung
    Kashyap, Mehr
    Chaurasia, Akash R.
    Shah, Nirav R.
    Singh, Karandeep
    Tazbaz, Troy
    Milstein, Arnold
    Pfeffer, Michael A.
    Shah, Nigam H.
    [J]. JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2025, 333 (04): : 319 - 328
  • [2] Belz A, 2021, Arxiv, DOI [arXiv:2103.07929, 10.48550/arXiv.2103.07929, DOI 10.48550/ARXIV.2103.07929]
  • [3] ChatGPT: standard reporting guidelines for responsible use
    Cacciamani, Giovanni E.
    Gill, Inderbir S.
    Collins, Gary S.
    [J]. NATURE, 2023, 618 (7964) : 238 - 238
  • [4] Protocol for the development of the Chatbot Assessment Reporting Tool (CHART) for clinical advice
    CHART Collaborative
    Huo, Bright
    McKechnie, Tyler
    Chartash, David
    Marshall, Iain J.
    Moher, David
    Ng, Jeremy Y.
    Loder, Elizabeth
    Feeney, Timothy
    Chan, An-Wen
    Berkwits, Michael
    Flanagin, Annette
    Antoniou, Stavros A.
    Laine, Christine
    Cacciamani, Giovanni E.
    Collins, Gary S.
    Saha, Shirbani
    Mathur, Piyush
    Iorio, Alfonso
    Lee, Yung
    Samuel, Diana
    Frankish, Helen
    Ortenzi, Monica
    Mayol, Julio
    Lokker, Cynthia
    Agoritsas, Thomas
    Vandvik, Per Olav
    Foroutan, Farid
    Meerpohl, Joerg J.
    Campos, Hugo
    Canfield, Carolyn
    Luo, Xufei
    Chen, Yaolong
    Harvey, Hugh
    Loeb, Stacy
    Agha, Riaz
    Ramji, Karim
    Ahmed, Hassaan
    Boudreau, Vanessa
    Guyatt, Gordon
    [J]. BMJ OPEN, 2024, 14 (05):
  • [5] Gallifant J, 2024, PREPRINT, DOI [10.1101/2024.07.24.24310930, DOI 10.1101/2024.07.24.24310930]
  • [6] Gilson Aidan, 2023, JMIR Med Educ, V9, pe45312, DOI 10.2196/45312
  • [7] Gundersen OE, 2018, AAAI CONF ARTIF INTE, P1644
  • [8] Reproducibility standards for machine learning in the life sciences
    Heil, Benjamin J.
    Hoffman, Michael M.
    Markowetz, Florian
    Lee, Su-In
    Greene, Casey S.
    Hicks, Stephanie C.
    [J]. NATURE METHODS, 2021, 18 (10) : 1132 - 1135
  • [9] Reporting standards for the use of large language model-linked chatbots for health advice
    Huo, Bright
    Cacciamani, Giovanni E.
    Collins, Gary S.
    Mckechnie, Tyler
    Lee, Yung
    Guyatt, Gordon
    [J]. NATURE MEDICINE, 2023, 29 (12) : 2988 - 2988
  • [10] Hutson M, 2018, SCIENCE, V359, P725, DOI 10.1126/science.359.6377.725