Large-Scale Validation of the Feasibility of GPT-4 as a Proofreading Tool for Head CT Reports

被引:6
作者
Kim, Songsoo [1 ]
Kim, Donghyun [3 ]
Shin, Hyun Joo [4 ,5 ,6 ]
Lee, Seung Hyun [7 ]
Kang, Yeseul [4 ,5 ]
Jeong, Sejin [4 ,5 ]
Kim, Jaewoong [1 ]
Han, Miran [8 ]
Lee, Seong-Joon [9 ]
Kim, Joonho [2 ]
Yum, Jungyon [2 ]
Han, Changho [1 ]
Yoon, Dukyong [1 ,6 ,10 ]
机构
[1] Yonsei Univ, Coll Med, Dept Biomed Syst Informat, 50-1 Yonsei Ro, Seoul 03722, South Korea
[2] Yonsei Univ, Coll Med, Dept Neurol, 50-1 Yonsei Ro, Seoul 03722, South Korea
[3] Mil Manpower Adm, Cent Draft Phys Examinat Off, Dept Radiol, Daegu, South Korea
[4] Yonsei Univ, Yongin Severance Hosp, Res Inst Radiol Sci, Coll Med,Dept Radiol, Yongin, South Korea
[5] Yonsei Univ, Yongin Severance Hosp, Coll Med, Ctr Clin Imaging Data Sci, Yongin, South Korea
[6] Yonsei Univ, Yongin Severance Hosp, Coll Med, Ctr Digital Hlth, Yongin, South Korea
[7] Yonsei Univ, Coll Med, Gangnam Severance Hosp, Dept Radiol, Seoul, South Korea
[8] Ajou Univ, Ajou Univ Hosp, Sch Med, Dept Radiol, Suwon, South Korea
[9] Ajou Univ, Ajou Univ Hosp, Dept Neurol, Sch Med, Suwon, South Korea
[10] Severance Hosp, Inst Innovat Digital Healthcare, Seoul, South Korea
关键词
LATERALITY ERRORS; RADIOLOGY;
D O I
10.1148/radiol.240701
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: The increasing workload of radiologists can lead to burnout and errors in radiology reports. Large language models, such as OpenAI's GPT-4, hold promise as error revision tools for radiology. Purpose: To test the feasibility of GPT-4 use by determining its error detection, reasoning, and revision performance on head CT reports with varying error types and to validate its clinical utility by comparison with human readers. Materials and Methods: A total of 10 300 head CT reports were retrospectively extracted from the Medical Information Mart for Intensive Care III public dataset. In experiment 1, among the 300 unaltered reports and 300 versions with applied errors, GPT-4 optimization was initially conducted with 200 reports. The remaining 400 were used for evaluation of error type detection, reasoning, and revision, as well as the analysis of reports with undetected errors. The performance was also compared with that of human readers. In experiment 2, the detection performance of GPT-4 was validated on 10 000 unaltered reports that were deemed error-free by physicians, and an analysis of false-positive results was conducted. A permutation test was conducted to assess differences in performance. Results: GPT-4 demonstrated commendable performance in error detection (sensitivity, 84% for interpretive error and 89% for factual error), reasoning, and revision. Compared with GPT-4, human readers had worse factual error detection sensitivity (0.33-0.69 vs 0.89; P = .008 for radiologist 4, P < .001 for others) and took longer to review (82-121 seconds vs 16 seconds, P < .001). In 10 000 reports, GPT-4 detected 96 errors, with a low positive predictive value of 0.05, yet 14% of the false-positive responses were potentially beneficial. Conclusion: GPT-4 effectively detects, reasons, and revises errors in radiology reports. While it shows excellent performance in identifying factual errors, its ability to prioritize clinically significant findings is limited. Recognizing its strengths and limitations, GPT-4 could serve as a feasible tool.
引用
收藏
页数:9
相关论文
共 29 条
[1]   Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study [J].
Adams, Lisa C. ;
Truhn, Daniel ;
Busch, Felix ;
Kader, Avan ;
Niehues, Stefan M. ;
Makowski, Marcus R. ;
Bressem, Keno K. .
RADIOLOGY, 2023, 307 (04)
[2]  
Amin KS, 2023, RADIOLOGY, V309, DOI 10.1148/radiol.232561
[3]  
[Anonymous], [28] [Online]. Available: https://azure.microsoft.com/en-us/products/microsoft-sentinel.
[4]   GPT-4 in Radiology: Improvements in Advanced Reasoning [J].
Bhayana, Rajesh ;
Bleakney, Robert R. ;
Krishna, Satheesh .
RADIOLOGY, 2023, 307 (05)
[5]   Understanding and Confronting Our Mistakes: The Epidemiology of Error in Radiology and Strategies for Error Reduction [J].
Bruno, Michael A. ;
Walker, Eric A. ;
Abujudeh, Hani H. .
RADIOGRAPHICS, 2015, 35 (06) :1668-1676
[6]   Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology Reports [J].
Chaudhari, Gunvant R. ;
Liu, Tengxiao ;
Chen, Timothy L. ;
Joseph, Gabby B. ;
Vella, Maya ;
Lee, Yoo Jin ;
Vu, Thienkhai H. ;
Seo, Youngho ;
Rauschecker, Andreas M. ;
McCulloch, Charles E. ;
Sohn, Jae Ho .
RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2022, 4 (04)
[7]  
Efron B., 1994, INTRO BOOTSTRAP, DOI 10.1201/9780429246593
[8]   Classification of Error in Abdominal Imaging: Pearls and Pitfalls for Radiologists [J].
Egri, Csilla ;
Darras, Kathryn E. ;
Scali, Elena P. ;
Harris, Alison C. .
CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2018, 69 (04) :409-416
[9]   Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer [J].
Fink, Matthias A. ;
Bischoff, Arved ;
Fink, Christoph A. ;
Moll, Martin ;
Kroschke, Jonas ;
Dulz, Luca ;
Heussel, Claus Peter ;
Kauczor, Hans-Ulrich ;
Weber, Tim F. .
RADIOLOGY, 2023, 308 (03)