Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports

被引:42
作者
Hasani, Amir M. [1 ]
Singh, Shiva [2 ]
Zahergivar, Aryan [2 ]
Ryan, Beth [3 ]
Nethala, Daniel [3 ]
Bravomontenegro, Gabriela [3 ]
Mendhiratta, Neil [3 ]
Ball, Mark [3 ]
Farhadi, Faraz [2 ]
Malayeri, Ashkan [2 ]
机构
[1] NHBLI, Lab Translat Res, NIH, Bethesda, MD USA
[2] NIH, Radiol & Imaging Sci Dept, Clin Ctr, Bethesda, MD 20892 USA
[3] NCI, Urol Oncol Branch, NIH, Bethesda, MD USA
基金
美国国家卫生研究院;
关键词
Artificial intelligence; Natural language processing; Digital health; Machine learning;
D O I
10.1007/s00330-023-10384-x
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
ObjectiveRadiology reporting is an essential component of clinical diagnosis and decision-making. With the advent of advanced artificial intelligence (AI) models like GPT-4 (Generative Pre-trained Transformer 4), there is growing interest in evaluating their potential for optimizing or generating radiology reports. This study aimed to compare the quality and content of radiologist-generated and GPT-4 AI-generated radiology reports.MethodsA comparative study design was employed in the study, where a total of 100 anonymized radiology reports were randomly selected and analyzed. Each report was processed by GPT-4, resulting in the generation of a corresponding AI-generated report. Quantitative and qualitative analysis techniques were utilized to assess similarities and differences between the two sets of reports.ResultsThe AI-generated reports showed comparable quality to radiologist-generated reports in most categories. Significant differences were observed in clarity (p = 0.027), ease of understanding (p = 0.023), and structure (p = 0.050), favoring the AI-generated reports. AI-generated reports were more concise, with 34.53 fewer words and 174.22 fewer characters on average, but had greater variability in sentence length. Content similarity was high, with an average Cosine Similarity of 0.85, Sequence Matcher Similarity of 0.52, BLEU Score of 0.5008, and BERTScore F1 of 0.8775.ConclusionThe results of this proof-of-concept study suggest that GPT-4 can be a reliable tool for generating standardized radiology reports, offering potential benefits such as improved efficiency, better communication, and simplified data extraction and analysis. However, limitations and ethical implications must be addressed to ensure the safe and effective implementation of this technology in clinical practice.Clinical relevance statementThe findings of this study suggest that GPT-4 (Generative Pre-trained Transformer 4), an advanced AI model, has the potential to significantly contribute to the standardization and optimization of radiology reporting, offering improved efficiency and communication in clinical practice.Key Points center dot Large language model-generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports.center dot Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports.center dot Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.Key Points center dot Large language model-generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports.center dot Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports.center dot Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.Key Points center dot Large language model-generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports.center dot Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports. center dot Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.
引用
收藏
页码:3566 / 3574
页数:9
相关论文
共 40 条
[31]   Performance of 4 Pre-Trained Sentence Transformer Models in the Semantic Query of a Systematic Review Dataset on Peri-Implantitis [J].
Galli, Carlo ;
Donos, Nikolaos ;
Calciolari, Elena .
INFORMATION, 2024, 15 (02)
[32]   Chat generative pre-trained transformer's performance on dermatology-specific questions and its implications in medical education [J].
Behrmann, James ;
Hong, Ellen M. ;
Meledathu, Shannon ;
Leiter, Aliza ;
Povelaitis, Michael ;
Mitre, Mariela .
JOURNAL OF MEDICAL ARTIFICIAL INTELLIGENCE, 2025, 6
[33]   Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions [J].
Avnish Sood ;
Nina Mansoor ;
Caroline Memmi ;
Magnus Lynch ;
Jeremy Lynch .
International Journal of Computer Assisted Radiology and Surgery, 2024, 19 :645-653
[34]   Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions [J].
Sood, Avnish ;
Mansoor, Nina ;
Memmi, Caroline ;
Lynch, Magnus ;
Lynch, Jeremy .
INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2024, 19 (04) :645-653
[35]   Diagnostic performance of generative pretrained transformer-4 with vision technology versus board-certified dermatologists: A comparative analysis using dermoscopic and clinical images [J].
Block, Brandon R. ;
Powers, Camille M. ;
Chang, Annie ;
Campbell, Caroline ;
Piontkowski, Austin J. ;
Orloff, Jeremy ;
Levoska, Melissa A. ;
Cices, Ahuva ;
Fenner, Justine ;
Talia, Jordan ;
Adalsteinsson, Jonas A. ;
Ungar, Jonathan ;
Gulati, Nicholas .
JAAD INTERNATIONAL, 2025, 18 :142-144
[36]   Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment [J].
Beeler, Muhammed Said ;
Oleaga, Laura ;
Junquero, Vanesa ;
Merino, Cristina .
ACADEMIC RADIOLOGY, 2024, 31 (11) :4365-4371
[37]   Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study [J].
Roos, Jonas ;
Martin, Ron ;
Kaczmarczyk, Robert .
JMIR FORMATIVE RESEARCH, 2024, 8
[38]   ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models [J].
Oh, Namkee ;
Choi, Gyu-Seong ;
Lee, Woo Yong .
ANNALS OF SURGICAL TREATMENT AND RESEARCH, 2023, 104 (05) :269-273
[39]   Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors [J].
Mitsuyama, Yasuhito ;
Tatekawa, Hiroyuki ;
Takita, Hirotaka ;
Sasaki, Fumi ;
Tashiro, Akane ;
Oue, Satoshi ;
Walston, Shannon L. ;
Nonomiya, Yuta ;
Shintani, Ayumi ;
Miki, Yukio ;
Ueda, Daiju .
EUROPEAN RADIOLOGY, 2025, 35 (04) :1938-1947
[40]   Enhancing Oncological Surveillance Through Large Language Model-Assisted Analysis: A Comparative Study of GPT-4 and Gemini in Evaluating Oncological Issues From Serial Abdominal CT Scan Reports [J].
Han, Na Yeon ;
Shin, Keewon ;
Kim, Min Ju ;
Park, Beom Jin ;
Sim, Ki Choon ;
Han, Yeo Eun ;
Sung, Deuk Jae ;
Choi, Jae Woong ;
Yeom, Suk Keu .
ACADEMIC RADIOLOGY, 2025, 32 (05) :2385-2391