ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language

被引:9
作者
Fervers, Philipp [1 ,2 ]
Hahnfeldt, Robert [1 ,2 ]
Kottlors, Jonathan [1 ,2 ]
Wagner, Anton [1 ,2 ]
Maintz, David [1 ,2 ]
dos Santos, Daniel Pinto [1 ,2 ,3 ]
Lennartz, Simon [1 ,2 ]
Persigehl, Thorsten [1 ,2 ]
机构
[1] Univ Cologne, Fac Med, Dept Diagnost & Intervent Radiol, Cologne, Germany
[2] Univ Hosp Cologne, Cologne, Germany
[3] Goethe Univ Frankfurt Main, Univ Hosp Frankfurt, Dept Diagnost & Intervent Radiol, Frankfurt, Germany
来源
FRONTIERS IN RADIOLOGY | 2024年 / 4卷
关键词
diagnostic imaging; neoplasms; liver; diagnosis; LI-RADS (liver imaging reporting and data system); MRI;
D O I
10.3389/fradi.2024.1390774
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports. Methods LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC). Results 205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 +/- 0.5 vs. 0.6 +/- 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05). Conclusions ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process. Clinical relevance statement Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.
引用
收藏
页数:9
相关论文
共 31 条
[1]   Do Large Language Models Understand Us? [J].
Aguera y Arcas, Blaise .
DAEDALUS, 2022, 151 (02) :183-197
[2]   Working memory and spatial judgments: Cognitive load increases the central tendency bias [J].
Allred, Sarah R. ;
Crawford, L. Elizabeth ;
Duffy, Sean ;
Smith, John .
PSYCHONOMIC BULLETIN & REVIEW, 2016, 23 (06) :1825-1831
[3]  
Antin B., 2017, Semanticscholar.org, P2017
[4]   The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare [J].
Aung, Yuri Y. M. ;
Wong, David C. S. ;
Ting, Daniel S. W. .
BRITISH MEDICAL BULLETIN, 2021, 139 (01) :4-15
[5]  
Bang Y, 2023, Arxiv, DOI [arXiv:2302.04023, DOI 10.48550/ARXIV.2302.04023]
[6]   ChatGPT: five priorities for research [J].
Bockting, Claudi ;
van Dis, Eva A. M. ;
Bollen, Johan ;
van Rooij, Robert ;
Zuidema, Willem L. .
NATURE, 2023, 614 (7947) :224-226
[7]   The Radiology Report as Seen by Radiologists and Referring Clinicians: Results of the COVER and ROVER Surveys [J].
Bosmans, Jan M. L. ;
Weyler, Joost J. ;
De Schepper, Arthur M. ;
Parizel, Paul M. .
RADIOLOGY, 2011, 259 (01) :184-195
[8]  
Brown TB, 2020, ADV NEUR IN, V33
[9]   Liver Imaging Reporting and Data System (LI-RADS) Version 2018: Imaging of Hepatocellular Carcinoma in At-Risk Patients [J].
Chernyak, Victoria ;
Fowler, Kathryn J. ;
Kamaya, Aya ;
Kielar, Ania Z. ;
Elsayes, Khaled M. ;
Bashir, Mustafa R. ;
Kono, Yuko ;
Do, Richard K. ;
Mitchell, Donald G. ;
Singal, Amit G. ;
Tang, An ;
Sirlin, Claude B. .
RADIOLOGY, 2018, 289 (03) :816-830
[10]   How to Use LI-RADS to Report Liver CT and MRI Observations [J].
Cunha, Guilherme M. ;
Fowler, Kathryn J. ;
Roudenko, Alexandra ;
Taouli, Bachir ;
Fung, Alice W. ;
Elsayes, Khaled M. ;
Marks, Robert M. ;
Cruite, Irene ;
Horvat, Natally ;
Chernyak, Victoria ;
Sirlin, Claude B. ;
Tang, An .
RADIOGRAPHICS, 2021, 41 (05) :1352-1367