Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer

被引：131

作者：

Fink, Matthias A. ^{[1
,3
,4
]}

Bischoff, Arved ^{[1
,3
,4
]}

Fink, Christoph A. ^{[2
]}

Moll, Martin ^{[1
]}

Kroschke, Jonas ^{[1
]}

Dulz, Luca ^{[1
,3
,4
]}

Heussel, Claus Peter ^{[1
,3
,4
,5
]}

Kauczor, Hans-Ulrich ^{[1
,3
,4
]}

Weber, Tim F. ^{[1
,3
,4
]}

机构：

[1] Univ Hosp Heidelberg, Clin Diagnost & Intervent Radiol, Neuenheimer Feld 420, D-69120 Heidelberg, Germany

[2] Univ Hosp Heidelberg, Dept Radiat Oncol, Neuenheimer Feld 420, D-69120 Heidelberg, Germany

[3] Translat Lung Res Ctr Heidelberg, Heidelberg, Germany

[4] German Ctr Lung Res, Heidelberg, Germany

[5] Heidelberg Univ, Dept Diagnost & Intervent Radiol Nucl Med, Heidelberg Thorac Clin, Heidelberg, Germany

来源：

RADIOLOGY | 2023年 / 308卷 / 03期

关键词：

D O I：

10.1148/radiol.231362

中图分类号：

R8 [特种医学]; R445 [影像诊断学];

学科分类号：

1002 ; 100207 ; 1009 ;

摘要：

Background: The latest large language models (LLMs) solve unseen problems via user-defined text prompts without the need for retraining, offering potentially more efficient information extraction from free-text medical records than manual annotation. Purpose: To compare the performance of the LLMs ChatGPT and GPT-4 in data mining and labeling oncologic phenotypes from free-text CT reports on lung cancer by using user-defined prompts. Materials and Methods: This retrospective study included patients who underwent lung cancer follow-up CT between September 2021 and March 2023. A subset of 25 reports was reserved for prompt engineering to instruct the LLMs in extracting lesion diameters, labeling metastatic disease, and assessing oncologic progression. This output was fed into a rule-based natural language processing pipeline to match ground truth annotations from four radiologists and derive performance metrics. The oncologic reasoning of LLMs was rated on a five-point Likert scale for factual correctness and accuracy. The occurrence of confabulations was recorded. Statistical analyses included Wilcoxon signed rank and McNemar tests. Results: On 424 CT reports from 424 patients (mean age, 65 years +/- 11 [SD]; 265 male), GPT-4 outperformed ChatGPT in extracting lesion parameters (98.6% vs 84.0%, P <.001), resulting in 96% correctly mined reports (vs 67% for ChatGPT, P <.001). GPT-4 achieved higher accuracy in identification of metastatic disease (98.1% [95% CI: 97.7, 98.5] vs 90.3% [95% CI: 89.4, 91.0]) and higher performance in generating correct labels for oncologic progression (F1 score, 0.96 [95% CI: 0.94, 0.98] vs 0.91 [95% CI: 0.89, 0.94]) (both P <.001). In oncologic reasoning, GPT-4 had higher Likert scale scores for factual correctness (4.3 vs 3.9) and accuracy (4.4 vs 3.3), with a lower rate of confabulation (1.7% vs 13.7%) than ChatGPT (all P <.001). Conclusion: When using user-defined prompts, GPT-4 outperformed ChatGPT in extracting oncologic phenotypes from free-text CT reports on lung cancer and demonstrated better oncologic reasoning with fewer confabulations. (c) RSNA, 2023

引用

页数：9

共 30 条

[1]

2023, Arxiv, DOI [arXiv:2303.08774, DOI 10.48550/ARXIV.2303.08774]

[2] Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study [J].