Automatic coronavirus disease 2019 diagnosis based on chest radiography and deep learning - Success story or dataset bias?

被引:10
作者
Dhont, Jennifer [1 ]
Wolfs, Cecile [1 ]
Verhaegen, Frank [1 ]
机构
[1] Maastricht Univ, GROW Sch Oncol, Dept Radiat Oncol Maastro, Med Ctr, Dr Tanslaan 12, NL-6229 ET Maastricht, Netherlands
关键词
artificial intelligence; COVID-19; dataset bias; X-ray imaging; COVID-19; CLASSIFIER; RADIOLOGY;
D O I
10.1002/mp.15419
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose Over the last 2 years, the artificial intelligence (AI) community has presented several automatic screening tools for coronavirus disease 2019 (COVID-19) based on chest radiography (CXR), with reported accuracies often well over 90%. However, it has been noted that many of these studies have likely suffered from dataset bias, leading to overly optimistic results. The purpose of this study was to thoroughly investigate to what extent biases have influenced the performance of a range of previously proposed and promising convolutional neural networks (CNNs), and to determine what performance can be expected with current CNNs on a realistic and unbiased dataset. Methods Five CNNs for COVID-19 positive/negative classification were implemented for evaluation, namely VGG19, ResNet50, InceptionV3, DenseNet201, and COVID-Net. To perform both internal and cross-dataset evaluations, four datasets were created. The first dataset Valencian Region Medical Image Bank (BIMCV) followed strict reverse transcriptase-polymerase chain reaction (RT-PCR) test criteria and was created from a single reliable open access databank, while the second dataset (COVIDxB8) was created through a combination of six online CXR repositories. The third and fourth datasets were created by combining the opposing classes from the BIMCV and COVIDxB8 datasets. To decrease inter-dataset variability, a pre-processing workflow of resizing, normalization, and histogram equalization were applied to all datasets. Classification performance was evaluated on unseen test sets using precision and recall. A qualitative sanity check was performed by evaluating saliency maps displaying the top 5%, 10%, and 20% most salient segments in the input CXRs, to evaluate whether the CNNs were using relevant information for decision making. In an additional experiment and to further investigate the origin of potential dataset bias, all pixel values outside the lungs were set to zero through automatic lung segmentation before training and testing. Results When trained and evaluated on the single online source dataset (BIMCV), the performance of all CNNs is relatively low (precision: 0.65-0.72, recall: 0.59-0.71), but remains relatively consistent during external evaluation (precision: 0.58-0.82, recall: 0.57-0.72). On the contrary, when trained and internally evaluated on the combinatory datasets, all CNNs performed well across all metrics (precision: 0.94-1.00, recall: 0.77-1.00). However, when subsequently evaluated cross-dataset, results dropped substantially (precision: 0.10-0.61, recall: 0.04-0.80). For all datasets, saliency maps revealed the CNNs rarely focus on areas inside the lungs for their decision-making. However, even when setting all pixel values outside the lungs to zero, classification performance does not change and dataset bias remains. Conclusions Results in this study confirm that when trained on a combinatory dataset, CNNs tend to learn the origin of the CXRs rather than the presence or absence of disease, a behavior known as short-cut learning. The bias is shown to originate from differences in overall pixel values rather than embedded text or symbols, despite consistent image pre-processing. When trained on a reliable, and realistic single-source dataset in which non-lung pixels have been masked, CNNs currently show limited sensitivity (<70%) for COVID-19 infection in CXR, questioning their use as a reliable automatic screening tool.
引用
收藏
页码:978 / 987
页数:10
相关论文
共 80 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   Explainability for artificial intelligence in healthcare: a multidisciplinary perspective [J].
Amann, Julia ;
Blasimme, Alessandro ;
Vayena, Effy ;
Frey, Dietmar ;
Madai, Vince I. .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2020, 20 (01)
[3]  
Ancona M., 2018, INT C LEARNING REPRE
[4]   Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks [J].
Apostolopoulos, Ioannis D. ;
Mpesiana, Tzani A. .
PHYSICAL AND ENGINEERING SCIENCES IN MEDICINE, 2020, 43 (02) :635-640
[5]  
Asif Sohaib, 2020, 2020 IEEE 6th International Conference on Computer and Communications (ICCC), P426, DOI 10.1109/ICCC51575.2020.9344870
[6]  
Baehrens D, 2010, J MACH LEARN RES, V11, P1803
[7]  
Ben Ahmed K, 2021, IEEE ACCESS, V9, P72970, DOI [10.1109/ACCESS.2021.3079716, 10.1109/access.2021.3079716]
[8]  
Bhagat V, 2019, 2019 FIFTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP 2019), P574, DOI 10.1109/ICIIP47207.2019.8985892
[9]  
Bradski G, 2000, DR DOBBS J, V25, P120
[10]   A diagnostic model for coronavirus disease 2019 (COVID-19) based on radiological semantic and clinical features: a multi-center study [J].
Chen, Xiaofeng ;
Tang, Yanyan ;
Mo, Yongkang ;
Li, Shengkai ;
Lin, Daiying ;
Yang, Zhijian ;
Yang, Zhiqi ;
Sun, Hongfu ;
Qiu, Jinming ;
Liao, Yuting ;
Xiao, Jianning ;
Chen, Xiangguang ;
Wu, Xianheng ;
Wu, Renhua ;
Dai, Zhuozhi .
EUROPEAN RADIOLOGY, 2020, 30 (09) :4893-4902