Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study

被引：4

作者：

Lopez-Ubeda, Pilar ^{[1
]}

Martin-Noguerol, Teodoro ^{[2
]}

Diaz-Angulo, Carolina ^{[3
]}

Luna, Antonio ^{[2
]}

机构：

[1] Nat Language Proc Unit, Hlth Time, Jaen, Spain

[2] Hlth Time, MRI Unit, Radiol Dept, Jaen, Spain

[3] Hlth Time, MRI Unit, Radiol Dept, Jaen, Spain

来源：

INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS | 2024年 / 187卷

关键词：

Radiology report summarization; Natural Language Processing; Large Language Model; Knee MRI reports; Human expert evaluation;

D O I：

10.1016/j.ijmedinf.2024.105443

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Objectives: This study addresses the critical need for accurate summarization in radiology by comparing various Large Language Model (LLM)-based approaches for automatic summary generation. With the increasing volume of patient information, accurately and concisely conveying radiological findings becomes crucial for effective clinical decision -making. Minor inaccuracies in summaries can lead to significant consequences, highlighting the need for reliable automated summarization tools. Methods: We employed two language models - Text -to -Text Transfer Transformer (T5) and Bidirectional and Auto -Regressive Transformers (BART) - in both fine-tuned and zero -shot learning scenarios and compared them with a Recurrent Neural Network (RNN). Additionally, we conducted a comparative analysis of 100 MRI report summaries, using expert human judgment and criteria such as coherence, relevance, fluency, and consistency, to evaluate the models against the original radiologist summaries. To facilitate this, we compiled a dataset of 15,508 retrospective knee Magnetic Resonance Imaging (MRI) reports from our Radiology Information System (RIS), focusing on the findings section to predict the radiologist 's summary. Results: The fine-tuned models outperform the neural network and show superior performance in the zero -shot variant. Specifically, the T5 model achieved a Rouge -L score of 0.638. Based on the radiologist readers ' study, the summaries produced by this model were found to be very similar to those produced by a radiologist, with about 70% similarity in fluency and consistency between the T5 -generated summaries and the original ones. Conclusions: Technological advances, especially in NLP and LLM, hold great promise for improving and streamlining the summarization of radiological findings, thus providing valuable assistance to radiologists in their work.

引用

页数：10

共 37 条

[31] Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study
Heilmeyer, Felix
Boehringer, Daniel
Reinhard, Thomas
Arens, Sebastian
Lyssenko, Lisa
Haverkamp, Christian
JMIR MEDICAL INFORMATICS, 2024, 12
[32] Do Large Language Models Show Decision Heuristics Similar to Humans? A Case Study Using GPT-3.5
Suri, Gaurav
Slater, Lily R.
Ziaee, Ali
Nguyen, Morgan
JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 2024, 153 (04) : 1066 - 1075
[33] Empowering PET imaging reporting with retrieval-augmented large language models and reading reports database: a pilot single center study
Choi, Hongyoon
Lee, Dongjoo
Kang, Yeon-koo
Suh, Minseok
EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2025, : 2452 - 2462
[34] Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study
Mugaanyi, Joseph
Cai, Liuying
Cheng, Sumei
Lu, Caide
Huang, Jing
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[35] Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models
Stroganov, Oleg
Schedlbauer, Amber
Lorenzen, Emily
Kadhim, Alex
Lobanova, Anna
Lewis, David A.
Glausier, Jill R.
BIOLOGY METHODS & PROTOCOLS, 2024, 9 (01)
[36] Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study
He, Zhe
Bhasuran, Balu
Jin, Qiao
Tian, Shubo
Hanna, Karim
Shavor, Cindy
Arguello, Lisbeth Garcia
Murray, Patrick
Lu, Zhiyong
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[37] Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study
Shi, Runhan
Liu, Steven
Xu, Xinwei
Ye, Zhengqiang
Yang, Jin
Le, Qihua
Qiu, Jini
Tian, Lijia
Wei, Anji
Shan, Kun
Zhao, Chen
Sun, Xinghuai
Zhou, Xingtao
Hong, Jiaxu
HELIYON, 2024, 10 (14)

← 1 2 3 4 →