Assessing the Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging

被引:148
作者
Arun, Nishanth [1 ,2 ]
Gaw, Nathan [3 ]
Singh, Praveer [1 ]
Chang, Ken [1 ,4 ]
Aggarwal, Mehak [1 ]
Chen, Bryan [1 ,4 ]
Hoebel, Katharina [1 ,4 ]
Gupta, Sharut [1 ]
Patel, Jay [1 ,4 ]
Gidwani, Mishka [1 ]
Adebayo, Julius [4 ]
Li, Matthew D. [1 ]
Kalpathy-Cramer, Jayashree [1 ]
机构
[1] Harvard Med Sch, Massachusetts Gen Hosp, Athinoula A Martinos Ctr Biomed Imaging, Dept Radiol, 149 13th St, Boston, MA 02129 USA
[2] Shiv Nadar Univ, Dept Comp Sci, Greater Noida, India
[3] Air Force Inst Technol, Grad Sch Engn & Management, Dept Operat Sci, Dayton, OH USA
[4] MIT, 77 Massachusetts Ave, Cambridge, MA 02139 USA
基金
美国国家卫生研究院;
关键词
Technology Assessment; Technical Aspects; Feature Detection; Convolutional Neural Network (CNN);
D O I
10.1148/ryai.2021200267
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Purpose: To evaluate the trustworthiness of saliency maps for abnormality localization in medical imaging. Materials and Methods: Using two large publicly available radiology datasets (Society for Imaging Informatics in Medicine-American College of Radiology Pneumothorax Segmentation dataset and Radiological Society of North America Pneumonia Detection Challenge dataset), the performance of eight commonly used saliency map techniques were quantified in regard to (a) localization utility (segmentation and detection), (b) sensitivity to model weight randomization, (c) repeatability, and (d) reproducibility. Their performances versus baseline methods and localization network architectures were compared, using area under the precision-recall curve (AUPRC) and structural similarity index measure (SSIM) as metrics. Results: All eight saliency map techniques failed at least one of the criteria and were inferior in performance compared with localization networks. For pneumothorax segmentation, the AUPRC ranged from 0.024 to 0.224, while a U-Net achieved a significantly superior AUPRC of 0.404 (P < .005). For pneumonia detection, the AUPRC ranged from 0.160 to 0.519, while a RetinaNet achieved a significantly superior AUPRC of 0.596 (P < .005). Five and two saliency methods (of eight) failed the model randomization test on the segmentation and detection datasets, respectively, suggesting that these methods are not sensitive to changes in model parameters. The repeatability and reproducibility of the majority of the saliency methods were worse than localization networks for both the segmentation and detection datasets. Conclusion: The use of saliency maps in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network. Supplemental material is available for this article. (C) RSNA, 2021.
引用
收藏
页数:12
相关论文
共 35 条
[1]  
Adebayo J, 2018, ADV NEUR IN, V31
[2]   NeuroMask: Explaining Predictions of Deep Neural Networks through Mask Learning [J].
Alzantot, Moustafa ;
Widdicombe, Amy ;
Julier, Simon ;
Srivastava, Mani .
2019 IEEE INTERNATIONAL CONFERENCE ON SMART COMPUTING (SMARTCOMP 2019), 2019, :81-86
[3]  
[Anonymous], SIIM ACRPNEUMOTHORAX
[4]  
[Anonymous], INCEPTIONV3
[5]  
[Anonymous], DENSENET121
[6]  
[Anonymous], 2014, INT C LEARN REPR ICL
[7]  
[Anonymous], SVP challengeEB/OL
[8]   Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet [J].
Bien, Nicholas ;
Rajpurkar, Pranav ;
Ball, Robyn L. ;
Irvin, Jeremy ;
Park, Allison ;
Jones, Erik ;
Bereket, Michael ;
Patel, Bhavik N. ;
Yeom, Kristen W. ;
Shpanskaya, Katie ;
Halabi, Safwan ;
Zucker, Evan ;
Fanton, Gary ;
Amanatullah, Derek F. ;
Beaulieu, Christopher F. ;
Riley, Geoffrey M. ;
Stewart, Russell J. ;
Blankenberg, Francis G. ;
Larson, David B. ;
Jones, Ricky H. ;
Langlotz, Curtis P. ;
Ng, Andrew Y. ;
Lungren, Matthew P. .
PLOS MEDICINE, 2018, 15 (11)
[9]  
Boyd Kendrick, 2013, Machine Learning and Knowledge Discovery in Databases. European Conference, ECML PKDD 2013. Proceedings: LNCS 8190, P451, DOI 10.1007/978-3-642-40994-3_29
[10]   Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density [J].
Chang, Ken ;
Beers, Andrew L. ;
Brink, Laura ;
Patel, Jay B. ;
Singh, Praveer ;
Arun, Nishanth T. ;
Hoebel, Katharina V. ;
Gaw, Nathan ;
Shah, Meesam ;
Pisano, Etta D. ;
Tilkin, Mike ;
Coombs, Laura P. ;
Dreyer, Keith J. ;
Allen, Bibb ;
Agarwal, Sheela ;
Kalpathy-Cramer, Jayashree .
JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2020, 17 (12) :1653-1662