BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study

被引:37
作者
Cozzi, Andrea [1 ]
Pinker, Katja [2 ]
Hidber, Andri [3 ]
Zhang, Tianyu [4 ,5 ,6 ]
Bonomo, Luca [1 ]
Lo Gullo, Roberto [2 ,4 ]
Christianson, Blake [2 ]
Curti, Marco [1 ]
Rizzo, Stefania [1 ,3 ]
Del Grande, Filippo [1 ,3 ]
Mann, Ritse M. [4 ,5 ]
Schiaffino, Simone [1 ,3 ]
机构
[1] Ente Osped Cantonale, Imaging Inst Southern Switzerland IIMSI, ViaTesserete 46, CH-6900 Lugano, Switzerland
[2] Mem Sloan Kettering Canc Ctr, Dept Radiol, Breast Imaging Serv, New York, NY USA
[3] Univ Svizzera italiana, Fac Biomed Sci, Lugano, Switzerland
[4] Netherlands Canc Inst, Dept Radiol, Amsterdam, Netherlands
[5] Radboud Univ Nijmegen, Dept Diagnost Imaging, Med Ctr, NL-6500 HB Nijmegen, Netherlands
[6] Maastricht Univ, GROW Res Inst Oncol & Reprod, Maastricht, Netherlands
关键词
INTEROBSERVER VARIABILITY; AGREEMENT; RELIABILITY;
D O I
10.1148/radiol.232133
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks. Purpose: To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management. Materials and Methods: This retrospective study included reports for women who underwent MRI, mammography, and/or US breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1-5 and in Italian, English, or Dutch were collected between January 2000 and October 2023. Board -certified breast radiologists and LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test. Results: Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for P < .001). Conclusion: LLMs achieved moderate agreement with human reader-assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management.
引用
收藏
页数:8
相关论文
共 38 条
[1]   Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study [J].
Adams, Lisa C. ;
Truhn, Daniel ;
Busch, Felix ;
Kader, Avan ;
Niehues, Stefan M. ;
Makowski, Marcus R. ;
Bressem, Keno K. .
RADIOLOGY, 2023, 307 (04)
[2]   Radiologist-Patient Communication: Current Practices and Barriers to Communication in Breast Imaging [J].
Aminololama-Shakeri, Shadi ;
Soo, Mary Scott ;
Grimm, Lars J. ;
Watts, Meredith R. ;
Poplack, Steven P. ;
Rapelyea, Jocelyn ;
Saphier, Nicole ;
Stack, Rand ;
Destounis, Stamatia .
JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2019, 16 (05) :709-716
[3]   Automatic inference of BI-RADS final assessment categories from narrative mammography report findings [J].
Banerjee, Imon ;
Bozkurt, Selen ;
Alkim, Emel ;
Sagreiya, Hersh ;
Kurian, Allison W. ;
Rubin, Daniel L. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 92
[4]   Appropriateness of Recommendations Provided by ChatGPT to Interventional Radiologists [J].
Barat, Maxime ;
Soyer, Philippe ;
Dohan, Anthony .
CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2023, 74 (04) :758-763
[5]  
Bard Google, ABOUT US
[6]   Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations [J].
Bhayana, Rajesh ;
Krishna, Satheesh ;
Bleakney, Robert R. .
RADIOLOGY, 2023, 307 (05)
[7]   Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis [J].
Cao, Jennie J. ;
Kwon, Daniel H. ;
Ghaziani, Tara T. ;
Kwo, Paul ;
Tse, Gary ;
Kesselman, Andrew ;
Kamaya, Aya ;
Tse, Justin R. .
AMERICAN JOURNAL OF ROENTGENOLOGY, 2023, 221 (04) :556-559
[8]   Nonpalpable breast lesions: impact of a second-opinion review at a breast unit on BI-RADS classification [J].
De Margerie-Mellon, Constance ;
Debry, Jean-Baptiste ;
Dupont, Axelle ;
Cuvier, Caroline ;
Giacchetti, Sylvie ;
Teixeira, Luis ;
Espie, Marc ;
de Bazelaire, Cedric .
EUROPEAN RADIOLOGY, 2021, 31 (08) :5913-5923
[9]   Breast imaging reporting and data system (BI-RADS) lexicon for breast MRI: Interobserver variability in the description and assignment of BI-RADS category [J].
El Khoury, Mona ;
Lalonde, Lucie ;
David, Julie ;
Labelle, Maude ;
Mesurolle, Benoit ;
Trop, Isabelle .
EUROPEAN JOURNAL OF RADIOLOGY, 2015, 84 (01) :71-76
[10]  
Gertz RJ, 2023, RADIOLOGY, V307