External Validation of an Ensemble Model for Automated Mammography Interpretation by Artificial Intelligence

被引:30
作者
Hsu, William [1 ]
Hippe, Daniel S. [2 ]
Nakhaei, Noor [1 ]
Wang, Pin-Chieh [3 ]
Zhu, Bing [1 ]
Siu, Nathan [4 ]
Ahsen, Mehmet Eren [5 ]
Lotter, William [6 ]
Sorensen, A. Gregory [6 ]
Naeim, Arash [7 ]
Buist, Diana S. M. [8 ]
Schaffter, Thomas [9 ]
Guinney, Justin [10 ]
Elmore, Joann G. [3 ]
Lee, Christoph, I [11 ,12 ,13 ]
机构
[1] Univ Calif Los Angeles, Dept Radiol Sci, Med & Imaging Informat, David Geffen Sch Med, 924 Westwood Blvd,Ste 420, Los Angeles, CA 90024 USA
[2] Fred Hutchinson Canc Ctr, Clin Res Div, Seattle, WA USA
[3] Univ Calif Los Angeles, Dept Med, David Geffen Sch Med, Los Angeles, CA 90024 USA
[4] Univ Calif Los Angeles, Grad Programs Biosci, Med Informat Home Area, David Geffen Sch Med, Los Angeles, CA 90024 USA
[5] Univ Illinois, Gies Coll Business, Urbana, IL USA
[6] RadNet AI Solut, DeepHlth, Cambridge, MA USA
[7] Univ Calif Los Angeles, Ctr Systemat Measurable Actionable Resilient & Te, Clin & Translat Sci Inst, David Geffen Sch Med, Los Angeles, CA 90024 USA
[8] Kaiser Permanente Washington Hlth Res Inst, Seattle, WA USA
[9] Sage Bionetworks, Computat Oncol, Seattle, WA USA
[10] Tempus Labs, Chicago, IL USA
[11] Univ Washington, Sch Med, Dept Radiol, Seattle, WA 98195 USA
[12] Univ Washington, Sch Publ Hlth, Dept Hlth Serv, Seattle, WA 98195 USA
[13] Fred Hutchinson Canc Ctr, Hutchinson Inst Canc Outcomes Res, Seattle, WA USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
BREAST-CANCER;
D O I
10.1001/jamanetworkopen.2022.42343
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
IMPORTANCE With a shortfall in fellowship-trained breast radiologists, mammography screening programs are looking toward artificial intelligence (AI) to increase efficiency and diagnostic accuracy. External validation studies provide an initial assessment of how promising AI algorithms perform in different practice settings. OBJECTIVE To externally validate an ensemble deep-learning model using data from a high-volume, distributed screening program of an academic health system with a diverse patient population. DESIGN, SETTING, AND PARTICIPANTS In this diagnostic study, an ensemble learningmethod, which reweights outputs of the 11 highest-performing individual AI models from the Digital Mammography Dialogue on Reverse Engineering Assessment and Methods (DREAM) Mammography Challenge, was used to predict the cancer status of an individual using a standard set of screening mammography images. This study was conducted using retrospective patient data collected between 2010 and 2020 from women aged 40 years and older who underwent a routine breast screening examination and participated in the Athena Breast Health Network at the University of California, Los Angeles (UCLA). MAIN OUTCOMES AND MEASURES Performance of the challenge ensemble method (CEM) and the CEM combined with radiologist assessment (CEM+R) were compared with diagnosed ductal carcinoma in situ and invasive cancers within a year of the screening examination using performance metrics, such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). RESULTS Evaluated on 37 317 examinations from 26 817 women (mean [SD] age, 58.4 [11.5] years), individual model AUROC estimates ranged from 0.77 (95% CI, 0.75-0.79) to 0.83 (95% CI, 0.81-0.85). The CEM model achieved an AUROC of 0.85 (95% CI, 0.84-0.87) in the UCLA cohort, lower than the performance achieved in the Kaiser Permanente Washington (AUROC, 0.90) and Karolinska Institute (AUROC, 0.92) cohorts. The CEM+R model achieved a sensitivity (0.813 [95% CI, 0.781-0.843] vs 0.826 [95% CI, 0.795-0.856]; P = .20) and specificity (0.925 [95% CI, 0.916-0.934] vs 0.930 [95% CI, 0.929-0.932]; P = .18) similar to the radiologist performance. The CEM+R model had significantly lower sensitivity (0.596 [95% CI, 0.466-0.717] vs 0.850 [95% CI, 0.766-0.923]; P < .001) and specificity (0.803 [95% CI, 0.734-0.861] vs 0.945 [95% CI, 0.936-0.954]; P < .001) than the radiologist in women with a prior history of breast cancer and Hispanic women (0.894 [95% CI, 0.873-0.910] vs 0.926 [95% CI, 0.919-0.933]; P = .004). CONCLUSIONS AND RELEVANCE This study found that the high performance of an ensemble deep-learning model for automated screening mammography interpretation did not generalize to a more diverse screening cohort, suggesting that the model experienced underspecification. This study suggests the need for model transparency and fine-tuning of AI models for specific target populations prior to their clinical adoption.
引用
收藏
页数:12
相关论文
共 18 条
[1]  
American College of Radiology Data Science Institute, AI CENTRAL
[2]   Independent External Validation of Artificial Intelligence Algorithms for Automated Interpretation of Screening Mammography: A Systematic Review [J].
Anderson, Anna W. ;
Marinovich, M. Luke ;
Houssami, Nehmat ;
Lowry, Kathryn P. ;
Elmore, Joann G. ;
Buist, Diana S. M. ;
Hofvind, Solveig ;
Lee, Christoph I. .
JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2022, 19 (02) :259-273
[3]   Artificial Intelligence: A Primer for Breast Imaging Radiologists [J].
Bahl, Manisha .
JOURNAL OF BREAST IMAGING, 2020, 2 (04) :304-314
[4]   Challenges to the Reproducibility of Machine Learning Models in Health Care [J].
Beam, Andrew L. ;
Manrai, Arjun K. ;
Ghassemi, Marzyeh .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2020, 323 (04) :305-306
[5]   Toward Generalizability in the Deployment of Artificial Intelligence in Radiology: Role of Computation Stress Testing to Overcome Underspecification [J].
Eche, Thomas ;
Schwartz, Lawrence H. ;
Mokrane, Fatima-Zohra ;
Dercle, Laurent .
RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2021, 3 (06)
[6]   The Athena Breast Health Network: developing a rapid learning system in breast cancer prevention, screening, treatment, and care [J].
Elson, Sarah L. ;
Hiatt, Robert A. ;
Anton-Culver, Hoda ;
Howell, Lydia P. ;
Naeim, Arash ;
Parker, Barbara A. ;
van't Veer, Laura J. ;
Hogarth, Michael ;
Pierce, John P. ;
DuWors, Robert J. ;
Hajopoulos, Kathy ;
Esserman, Laura J. .
BREAST CANCER RESEARCH AND TREATMENT, 2013, 140 (02) :417-425
[7]   Establishing the rules for building trustworthy AI [J].
Floridi, Luciano .
NATURE MACHINE INTELLIGENCE, 2019, 1 (06) :261-262
[8]   PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R [J].
Grau, Jan ;
Grosse, Ivo ;
Keilwagen, Jens .
BIOINFORMATICS, 2015, 31 (15) :2595-2597
[9]   Artificial Intelligence (AI) for the early detection of breast cancer: a scoping review to assess AI's potential in breast screening practice [J].
Houssami, Nehmat ;
Kirkpatrick-Jones, Georgia ;
Noguchi, Naomi ;
Lee, Christoph I. .
EXPERT REVIEW OF MEDICAL DEVICES, 2019, 16 (05) :351-362
[10]   Artificial Intelligence in Breast Imaging: Potentials and Limitations [J].
Mendelson, Ellen B. .
AMERICAN JOURNAL OF ROENTGENOLOGY, 2019, 212 (02) :293-299