Hurdles to Artificial Intelligence Deployment: Noise in Schemas and "Gold" Labels

被引:13
作者
Abdalla, Mohamed [1 ,2 ]
Fine, Benjamin [1 ,3 ]
机构
[1] Trillium Hlth Partners, Inst Better Hlth, Mississauga, ON, Canada
[2] Univ Toronto, Ctr Informat Technol, Dept Comp Sci, 40 St George St,Room 4283, Toronto, ON M5S 2E4, Canada
[3] Univ Toronto, Dept Med Imaging, 40 St George St,Room 4283, Toronto, ON M5S 2E4, Canada
关键词
CHEST RADIOGRAPHS; VARIABILITY; DIAGNOSIS;
D O I
10.1148/ryai.220056
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite frequent reports of imaging artificial intelligence (AI) that parallels human performance, clinicians often question the safety and ro-bustness of AI products in practice. This work explores two underreported sources of noise that negatively affect imaging AI: (a) variation in labeling schema definitions and (b) noise in the labeling process. First, the overlap between the schemas of two publicly available datasets and a third-party vendor are compared, showing there is low agreement (< 50%) between them. The authors also highlight the problem of label inconsistency, where different annotation schemas are selected for the same clinical prediction task; this results in inconsistent use of medi-cal ontologies through intermingling or duplicate observations and diseases. Second, the individual radiologist annotations for the CheXpert test set are used to quantify noise in the labeling process. The analysis demonstrated that label noise varies by class, as agreement was high for pneumothorax and medical devices (percent agreement > 90%). Among low agreement classes (pneumonia, consolidation), the labels assigned as "ground truth" were unreliable, suggesting that the result of majority voting is highly dependent on which group of radiologists is assigned to annotation. Noise in labeling schemas and gold label annotations are pervasive in medical imaging classification and affect downstream clinical deployment. Possible solutions (eg, changes to task design, annotation methods, and model training) and their potential to improve trust in clinical AI are discussed.
引用
收藏
页数:8
相关论文
共 34 条
[1]   How Structured Use Cases Can Drive the Adoption of Artificial Intelligence Tools in Clinical Practice [J].
Allen, Bibb .
JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2018, 15 (12) :1758-1760
[2]  
Asaadi S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P505
[3]   Artificial Intelligence and Human Trust in Healthcare: Focus on Clinicians [J].
Asan, Onur ;
Bayrak, Alparslan Emrah ;
Choudhury, Avishek .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (06)
[4]   Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study [J].
Balabanova, Y ;
Coker, R ;
Fedorin, I ;
Zakharova, S ;
Plavinskij, S ;
Krukov, N ;
Atun, R ;
Drobniewski, F .
BMJ-BRITISH MEDICAL JOURNAL, 2005, 331 (7513) :379-+
[5]   Response styles in marketing research: A cross-national investigation [J].
Baumgartner, H ;
Steenkamp, JBEM .
JOURNAL OF MARKETING RESEARCH, 2001, 38 (02) :143-156
[6]   Informatics in Radiology Radiology Gamuts Ontology: Differential Diagnosis for the Semantic Web [J].
Budovec, Joseph J. ;
Lam, Cesar A. ;
Kahn, Charles E., Jr. .
RADIOGRAPHICS, 2014, 34 (01) :254-264
[7]   Improving reference standards for validation of AI-based radiography [J].
Duggan, Gavin E. ;
Reicher, Joshua J. ;
Liu, Yun ;
Tse, Daniel ;
Shetty, Shravya .
BRITISH JOURNAL OF RADIOLOGY, 2021, 94 (1123)
[8]  
FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619
[9]  
Garbin C, 2021, Arxiv, DOI [arXiv:2105.03020, 10.48550/arXiv.2105.03020, DOI 10.48550/ARXIV.2105.03020]
[10]   Chest radiographs in the emergency department: is the radiologist really necessary? [J].
Gatt, ME ;
Spectre, G ;
Paltiel, O ;
Hiller, N ;
Stalnikowicz, R .
POSTGRADUATE MEDICAL JOURNAL, 2003, 79 (930) :214-217