Assessing GPT-4 multimodal performance in radiological image analysis

被引:6
作者
Brin, Dana [1 ,2 ]
Sorin, Vera [1 ,2 ,3 ]
Barash, Yiftach [1 ,2 ,3 ]
Konen, Eli [1 ,2 ]
Glicksberg, Benjamin S. [4 ]
Nadkarni, Girish N. [5 ,6 ]
Klang, Eyal [1 ,2 ,3 ,5 ,6 ]
机构
[1] Chaim Sheba Med Ctr, Dept Diagnost Imaging, Tel Hashomer, Israel
[2] Tel Aviv Univ, Fac Med, Tel Aviv, Israel
[3] Chaim Sheba Med Ctr, DeepVis Lab, Tel Hashomer, Israel
[4] Icahn Sch Med Mt Sinai, Hasso Plattner Inst Digital Hlth, New York, NY USA
[5] Icahn Sch Med Mt Sinai, Div Data Driven & Digital Med D3M, New York, NY USA
[6] Icahn Sch Med Mt Sinai, Charles Bronfman Inst Personalized Med, New York, NY USA
关键词
Artificial intelligence; Diagnostic imaging; Radiology; Ultrasonography; Computed tomography (x-ray);
D O I
10.1007/s00330-024-11035-5
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objectives This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology. Methods We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images. Results GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model's performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001). These variations indicate inconsistencies in GPT-4V's ability to interpret radiological images accurately. Conclusion While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics. Clinical relevance statement Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety. Key Points...
引用
收藏
页码:1959 / 1965
页数:7
相关论文
共 20 条
  • [1] Potential Applications and Impact of ChatGPT in Radiology
    Bajaj, Suryansh
    Gandhi, Darshan
    Nayar, Divya
    [J]. ACADEMIC RADIOLOGY, 2024, 31 (04) : 1256 - 1261
  • [2] Crimì F, 2023, RADIOLOGY, V308, DOI 10.1148/radiol.231701
  • [3] Exploring the Clinical Translation of Generative Models Like ChatGPT: Promise and Pitfalls in Radiology, From Patients to Population Health
    Doo, Florence X.
    Cook, Tessa S.
    Siegel, Eliot L.
    Joshi, Anupam
    Parekh, Vishwa
    Elahi, Ameena
    Yi, Paul H.
    [J]. JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2023, 20 (09) : 877 - 885
  • [4] Gertz Roman Johannes, 2023, Radiology, V307, pe230877, DOI 10.1148/radiol.230877
  • [5] Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports
    Hasani, Amir M.
    Singh, Shiva
    Zahergivar, Aryan
    Ryan, Beth
    Nethala, Daniel
    Bravomontenegro, Gabriela
    Mendhiratta, Neil
    Ball, Mark
    Farhadi, Faraz
    Malayeri, Ashkan
    [J]. EUROPEAN RADIOLOGY, 2024, 34 (06) : 3566 - 3574
  • [6] Health system-scale language models are all-purpose prediction engines
    Jiang, Lavender Yao
    Liu, Xujin Chris
    Nejatian, Nima Pour
    Nasir-Moin, Mustafa
    Wang, Duo
    Abidin, Anas
    Eaton, Kevin
    Riina, Howard Antony
    Laufer, Ilya
    Punjabi, Paawan
    Miceli, Madeline
    Kim, Nora C.
    Orillac, Cordelia
    Schnurman, Zane
    Livia, Christopher
    Weiss, Hannah
    Kurland, David
    Neifert, Sean
    Dastagirzada, Yosef
    Kondziolka, Douglas
    Cheung, Alexander T. M.
    Yang, Grace
    Cao, Ming
    Flores, Mona
    Costa, Anthony B.
    Aphinyanaphongs, Yindalon
    Cho, Kyunghyun
    Oermann, Eric Karl
    [J]. NATURE, 2023, 619 (7969) : 357 - +
  • [7] Deep learning and medical imaging
    Klang, Eyal
    [J]. JOURNAL OF THORACIC DISEASE, 2018, 10 (03) : 1325 - 1328
  • [8] Population-wide evaluation of artificial intelligence and radiologist assessment of screening mammograms
    Kuhl, Johanne
    Elhakim, Mohammad Talal
    Stougaard, Sarah Wordenskjold
    Rasmussen, Benjamin Schnack Brandt
    Nielsen, Mads
    Gerke, Oke
    Larsen, Lisbet Bronsro
    Graumann, Ole
    [J]. EUROPEAN RADIOLOGY, 2024, 34 (06) : 3935 - 3946
  • [9] Added value of an artificial intelligence algorithm in reducing the number of missed incidental acute pulmonary embolism in routine portal venous phase chest CT
    Langius-Wiffen, Eline
    de Jong, Pim A.
    Hoesein, Firdaus Mohamed A.
    Dekker, Lisette
    Van den Hoven, Andor F.
    Nijholt, Ingrid M.
    Boomsma, Martijn F.
    Veldhuis, Wouter B.
    [J]. EUROPEAN RADIOLOGY, 2024, 34 (01) : 367 - 373
  • [10] The Future of AI and Informatics in Radiology: 10 Predictions
    Langlotz, Curtis P.
    [J]. RADIOLOGY, 2023, 309 (01)