Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study

被引:4
|
作者
Lopez-Ubeda, Pilar [1 ]
Martin-Noguerol, Teodoro [2 ]
Diaz-Angulo, Carolina [3 ]
Luna, Antonio [2 ]
机构
[1] Nat Language Proc Unit, Hlth Time, Jaen, Spain
[2] Hlth Time, MRI Unit, Radiol Dept, Jaen, Spain
[3] Hlth Time, MRI Unit, Radiol Dept, Jaen, Spain
关键词
Radiology report summarization; Natural Language Processing; Large Language Model; Knee MRI reports; Human expert evaluation;
D O I
10.1016/j.ijmedinf.2024.105443
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives: This study addresses the critical need for accurate summarization in radiology by comparing various Large Language Model (LLM)-based approaches for automatic summary generation. With the increasing volume of patient information, accurately and concisely conveying radiological findings becomes crucial for effective clinical decision -making. Minor inaccuracies in summaries can lead to significant consequences, highlighting the need for reliable automated summarization tools. Methods: We employed two language models - Text -to -Text Transfer Transformer (T5) and Bidirectional and Auto -Regressive Transformers (BART) - in both fine-tuned and zero -shot learning scenarios and compared them with a Recurrent Neural Network (RNN). Additionally, we conducted a comparative analysis of 100 MRI report summaries, using expert human judgment and criteria such as coherence, relevance, fluency, and consistency, to evaluate the models against the original radiologist summaries. To facilitate this, we compiled a dataset of 15,508 retrospective knee Magnetic Resonance Imaging (MRI) reports from our Radiology Information System (RIS), focusing on the findings section to predict the radiologist 's summary. Results: The fine-tuned models outperform the neural network and show superior performance in the zero -shot variant. Specifically, the T5 model achieved a Rouge -L score of 0.638. Based on the radiologist readers ' study, the summaries produced by this model were found to be very similar to those produced by a radiologist, with about 70% similarity in fluency and consistency between the T5 -generated summaries and the original ones. Conclusions: Technological advances, especially in NLP and LLM, hold great promise for improving and streamlining the summarization of radiological findings, thus providing valuable assistance to radiologists in their work.
引用
收藏
页数:10
相关论文
共 37 条
  • [21] Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study
    de Oliveira, Adonias Caetano
    Bessa, Renato Freitas
    Soares, Ariel
    CADERNOS DE SAUDE PUBLICA, 2024, 40 (10):
  • [22] Utilizing a domain-specific large language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study
    Matute-Gonzalez, Mario
    Darnell, Anna
    Comas-Cufi, Marc
    Pazo, Javier
    Soler, Alexandre
    Saborido, Belen
    Mauro, Ezequiel
    Turnes, Juan
    Forner, Alejandro
    Reig, Maria
    Rimola, Jordi
    INSIGHTS INTO IMAGING, 2024, 15 (01):
  • [23] Robustness Evaluation of Cloud-Deployed Large Language Models against Chinese Adversarial Text Attacks
    Zhang, Yunting
    Ye, Lin
    Li, Baisong
    Zhang, Hongli
    2023 IEEE 12TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING, CLOUDNET, 2023, : 438 - 442
  • [24] Performance Evaluation and Application Potential of Small Large Language Models in Complex Sentiment Analysis Tasks
    Yang, Yunchu
    Li, Jiaxuan
    Guo, Jielong
    Pang, Patrick Cheong-Iao
    Wang, Yapeng
    Yang, Xu
    Im, Sio-Kei
    IEEE ACCESS, 2025, 13 : 49007 - 49017
  • [25] Natural language processing for automatic evaluation of free-text answers - a feasibility study based on the European Diploma in Radiology examination
    Stoehr, Fabian
    Kaempgen, Benedikt
    Mueller, Lukas
    Zufiria, Laura Oleaga
    Junquero, Vanesa
    Merino, Cristina
    Mildenberger, Peter
    Kloeckner, Roman
    INSIGHTS INTO IMAGING, 2023, 14 (01)
  • [26] An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study
    Sivarajkumar, Sonish
    Kelley, Mark
    Samolyk-Mazzanti, Alyssa
    Visweswaran, Shyam
    Wang, Yanshan
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [27] Natural language processing for automatic evaluation of free-text answers — a feasibility study based on the European Diploma in Radiology examination
    Fabian Stoehr
    Benedikt Kämpgen
    Lukas Müller
    Laura Oleaga Zufiría
    Vanesa Junquero
    Cristina Merino
    Peter Mildenberger
    Roman Kloeckner
    Insights into Imaging, 14
  • [28] Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study
    Cardamone, Nicholas C.
    Olfson, Mark
    Schmutte, Timothy
    Ungar, Lyle
    Liu, Tony
    Cullen, Sara W.
    Williams, Nathaniel J.
    Marcus, Steven C.
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [29] Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM)
    Park, Seong Ho
    Suh, Chong Hyun
    Lee, Jeong Hyun
    Kahn Jr, Charles E.
    Moy, Linda
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (10) : 865 - 868
  • [30] The natural language processing of radiology requests and reports of chest imaging: Comparing five transformer models' multilabel classification and a proof-of-concept study
    Olthof, Allard W.
    van Ooijen, Peter M. A.
    Cornelissen, Ludo J.
    HEALTH INFORMATICS JOURNAL, 2022, 28 (04)