Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study

被引:4
|
作者
Lopez-Ubeda, Pilar [1 ]
Martin-Noguerol, Teodoro [2 ]
Diaz-Angulo, Carolina [3 ]
Luna, Antonio [2 ]
机构
[1] Nat Language Proc Unit, Hlth Time, Jaen, Spain
[2] Hlth Time, MRI Unit, Radiol Dept, Jaen, Spain
[3] Hlth Time, MRI Unit, Radiol Dept, Jaen, Spain
关键词
Radiology report summarization; Natural Language Processing; Large Language Model; Knee MRI reports; Human expert evaluation;
D O I
10.1016/j.ijmedinf.2024.105443
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives: This study addresses the critical need for accurate summarization in radiology by comparing various Large Language Model (LLM)-based approaches for automatic summary generation. With the increasing volume of patient information, accurately and concisely conveying radiological findings becomes crucial for effective clinical decision -making. Minor inaccuracies in summaries can lead to significant consequences, highlighting the need for reliable automated summarization tools. Methods: We employed two language models - Text -to -Text Transfer Transformer (T5) and Bidirectional and Auto -Regressive Transformers (BART) - in both fine-tuned and zero -shot learning scenarios and compared them with a Recurrent Neural Network (RNN). Additionally, we conducted a comparative analysis of 100 MRI report summaries, using expert human judgment and criteria such as coherence, relevance, fluency, and consistency, to evaluate the models against the original radiologist summaries. To facilitate this, we compiled a dataset of 15,508 retrospective knee Magnetic Resonance Imaging (MRI) reports from our Radiology Information System (RIS), focusing on the findings section to predict the radiologist 's summary. Results: The fine-tuned models outperform the neural network and show superior performance in the zero -shot variant. Specifically, the T5 model achieved a Rouge -L score of 0.638. Based on the radiologist readers ' study, the summaries produced by this model were found to be very similar to those produced by a radiologist, with about 70% similarity in fluency and consistency between the T5 -generated summaries and the original ones. Conclusions: Technological advances, especially in NLP and LLM, hold great promise for improving and streamlining the summarization of radiological findings, thus providing valuable assistance to radiologists in their work.
引用
收藏
页数:10
相关论文
共 37 条
  • [31] Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study
    Heilmeyer, Felix
    Boehringer, Daniel
    Reinhard, Thomas
    Arens, Sebastian
    Lyssenko, Lisa
    Haverkamp, Christian
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [32] Do Large Language Models Show Decision Heuristics Similar to Humans? A Case Study Using GPT-3.5
    Suri, Gaurav
    Slater, Lily R.
    Ziaee, Ali
    Nguyen, Morgan
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 2024, 153 (04) : 1066 - 1075
  • [33] Empowering PET imaging reporting with retrieval-augmented large language models and reading reports database: a pilot single center study
    Choi, Hongyoon
    Lee, Dongjoo
    Kang, Yeon-koo
    Suh, Minseok
    EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2025, : 2452 - 2462
  • [34] Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study
    Mugaanyi, Joseph
    Cai, Liuying
    Cheng, Sumei
    Lu, Caide
    Huang, Jing
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [35] Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models
    Stroganov, Oleg
    Schedlbauer, Amber
    Lorenzen, Emily
    Kadhim, Alex
    Lobanova, Anna
    Lewis, David A.
    Glausier, Jill R.
    BIOLOGY METHODS & PROTOCOLS, 2024, 9 (01)
  • [36] Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study
    He, Zhe
    Bhasuran, Balu
    Jin, Qiao
    Tian, Shubo
    Hanna, Karim
    Shavor, Cindy
    Arguello, Lisbeth Garcia
    Murray, Patrick
    Lu, Zhiyong
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [37] Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study
    Shi, Runhan
    Liu, Steven
    Xu, Xinwei
    Ye, Zhengqiang
    Yang, Jin
    Le, Qihua
    Qiu, Jini
    Tian, Lijia
    Wei, Anji
    Shan, Kun
    Zhao, Chen
    Sun, Xinghuai
    Zhou, Xingtao
    Hong, Jiaxu
    HELIYON, 2024, 10 (14)