Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density

被引:41
作者
Chang, Ken [1 ]
Beers, Andrew L. [1 ]
Brink, Laura [2 ]
Patel, Jay B. [1 ]
Singh, Praveer [1 ]
Arun, Nishanth T. [1 ]
Hoebel, Katharina V. [1 ]
Gaw, Nathan [1 ]
Shah, Meesam [2 ]
Pisano, Etta D. [3 ,4 ]
Tilkin, Mike [5 ]
Coombs, Laura P. [3 ]
Dreyer, Keith J. [6 ,7 ,8 ,9 ,10 ]
Allen, Bibb [10 ,11 ,12 ]
Agarwal, Sheela [13 ]
Kalpathy-Cramer, Jayashree [1 ,14 ,15 ,16 ,17 ]
机构
[1] Massachusetts Gen Hosp, Dept Radiol, Athinoula A Martinos Ctr Biomed Imaging, Boston, MA USA
[2] Amer Coll Radiol, Reston, VA USA
[3] ACR, Reston, VA USA
[4] Beth Israel Lahey Harvard Med Sch, Residence, Boston, MA USA
[5] ACR, Technol, Reston, VA USA
[6] MGH & BWH, Boston, MA USA
[7] MGH & BWH, Ctr Clin Data Sci, Boston, MA USA
[8] MGH & BWH, Radiol Informat, Boston, MA USA
[9] Harvard Med Sch, Radiol, Boston, MA 02115 USA
[10] ACR Data Sci Inst, Reston, VA USA
[11] Int Soc Radiol, Reston, VA USA
[12] Grandview Med Ctr, Birmingham, AL USA
[13] Lennox Hill Radiol, New York, NY USA
[14] Harvard Med Sch, CCDS, Boston, MA 02115 USA
[15] Harvard Med Sch, QTIM Lab, Boston, MA 02115 USA
[16] Harvard Med Sch, Ctr Machine Learning, Boston, MA 02115 USA
[17] Harvard Med Sch, Radiol, MGH, Boston, MA 02115 USA
基金
美国国家卫生研究院;
关键词
ACR AI-LAB; artificial intelligence; BI-RADS; breast density; deep learning; DMIST; generalizability; mammogram; neural networks; CANCER; MAMMOGRAPHY; RADIOLOGISTS; PERFORMANCE; IMPACT; RISK;
D O I
10.1016/j.jacr.2020.05.015
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: We developed deep learning algorithms to automatically assess BI-RADS breast density. Methods: Using a large multi-institution patient cohort of 108,230 digital screening mammograms from the Digital Mammographic Imaging Screening Trial, we investigated the effect of data, model, and training parameters on overall model performance and provided crowdsourcing evaluation from the attendees of the ACR 2019 Annual Meeting. Results: Our best-performing algorithm achieved good agreement with radiologists who were qualified interpreters of mammograms, with a four-class kappa of 0.667. When training was performed with randomly sampled images from the data set versus sampling equal number of images from each density category, the model predictions were biased away from the low-prevalence categories such as extremely dense breasts. The net result was an increase in sensitivity and a decrease in specificity for predicting dense breasts for equal class compared with random sampling. We also found that the performance of the model degrades when we evaluate on digital mammography data formats that differ from the one that we trained on, emphasizing the importance of multi-institutional training sets. Lastly, we showed that crowdsourced annotations, including those from attendees who routinely read mammograms, had higher agreement with our algorithm than with the original interpreting radiologists. Conclusion: We demonstrated the possible parameters that can influence the performance of the model and how crowdsourcing can be used for evaluation. This study was performed in tandem with the development of the ACR AI-LAB, a platform for democratizing artificial intelligence.
引用
收藏
页码:1653 / 1662
页数:10
相关论文
共 47 条
  • [41] Rethinking the Inception Architecture for Computer Vision
    Szegedy, Christian
    Vanhoucke, Vincent
    Ioffe, Sergey
    Shlens, Jon
    Wojna, Zbigniew
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2818 - 2826
  • [42] Tabár L, 2001, CANCER-AM CANCER SOC, V91, P1724, DOI 10.1002/1097-0142(20010501)91:9<1724::AID-CNCR1190>3.0.CO
  • [43] 2-V
  • [44] Van Hulse J., 2007, P 24 INT C MACH LEAR, P935, DOI DOI 10.1145/1273496.1273614
  • [45] Deep Neural Networks Improve Radiologists Performance in Breast Cancer Screening
    Wu, Nan
    Phang, Jason
    Park, Jungkyu
    Shen, Yiqiu
    Huang, Zhe
    Zorin, Masha
    Jastrzebski, Stanislaw
    Fevry, Thibault
    Katsnelson, Joe
    Kim, Eric
    Wolfson, Stacey
    Parikh, Ujas
    Gaddam, Sushma
    Lin, Leng Leng Young
    Ho, Kara
    Weinstein, Joshua D.
    Reig, Beatriu
    Gao, Yiming
    Toth, Hildegard
    Pysarenko, Kristine
    Lewin, Alana
    Lee, Jiyon
    Airola, Krystal
    Mema, Eralda
    Chung, Stephanie
    Hwang, Esther
    Samreen, Naziya
    Kim, S. Gene
    Heacock, Laura
    Moy, Linda
    Cho, Kyunghyun
    Geras, Krzysztof J.
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2020, 39 (04) : 1184 - 1194
  • [46] Automated Volumetric Breast Density Measurements in the Era of the BI-RADS Fifth Edition: A Comparison With Visual Assessment
    Youk, Ji Hyun
    Gweon, Hye Mi
    Son, Eun Ju
    Kim, Jeong-Ah
    [J]. AMERICAN JOURNAL OF ROENTGENOLOGY, 2016, 206 (05) : 1056 - 1062
  • [47] Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study
    Zech, John R.
    Badgeley, Marcus A.
    Liu, Manway
    Costa, Anthony B.
    Titano, Joseph J.
    Oermann, Eric Karl
    [J]. PLOS MEDICINE, 2018, 15 (11)