ROC operating point selection for classification of imbalanced data with application to computer-aided polyp detection in CT colonography

被引:54
作者
Song, Bowen [1 ,2 ]
Zhang, Guopeng [3 ]
Zhu, Wei [2 ]
Liang, Zhengrong [1 ]
机构
[1] SUNY Stony Brook, Dept Radiol, Stony Brook, NY 11790 USA
[2] SUNY Stony Brook, Dept Appl Math & Stat, Stony Brook, NY 11790 USA
[3] Fourth Mil Med Univ, Dept Biomed Engn, Xian 710032, Shaanxi, Peoples R China
关键词
Computer-aided detection and diagnosis (CAD); Computed tomography colonography (CTC); Random forests; Harmonic mean; Support vector machine (SVM); Receiver operating characteristic (ROC); TOMOGRAPHIC VIRTUAL COLONOSCOPY; SUPPORT VECTOR MACHINES; FEATURES;
D O I
10.1007/s11548-013-0913-8
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Computer-aided detection and diagnosis (CAD) of colonic polyps always faces the challenge of classifying imbalanced data. In this paper, three new operating point selection strategies based on receiver operating characteristic curve are proposed to address the problem. Classification on imbalanced data performs inferiorly because of a major reason that the best differentiation threshold shifts due to the degree of data imbalance. To address this decision threshold shifting issue, three operating point selection strategies, i.e., shortest distance, harmonic mean and anti-harmonic mean, are proposed and their performances are investigated. Experiments were conducted on a class-imbalanced database, which contains 64 polyps in 786 polyp candidates. Support vector machine (SVM) and random forests (RFs) were employed as basic classifiers. Two imbalanced data correcting techniques, i.e., cost-sensitive learning and training data down sampling, were applied to SVM and RFs, and their performances were compared with the proposed strategies. Comparing to the original thresholding method, i.e., 0.488 sensitivity and 0.986 specificity for RFs and 0.526 sensitivity and 0.977 specificity for SVM, our strategies achieved more balanced results, which are around 0.89 sensitivity and 0.92 specificity for RFs and 0.88 sensitivity and 0.90 specificity for SVM. Meanwhile, their performance remained at the same level regardless of whether other correcting methods are used. Based on the above experiments, the gain of our proposed strategies is noticeable: the sensitivity improved from 0.5 to around 0.88 for RFs and 0.89 for SVM while remaining a relatively high level of specificity, i.e., 0.92 for RFs and 0.90 for SVM. The performance of our proposed strategies was adaptive and robust with different levels of imbalanced data. This indicates a feasible solution to the shifting problem for favorable sensitivity and specificity in CAD of polyps from imbalanced data.
引用
收藏
页码:79 / 89
页数:11
相关论文
共 34 条
[1]  
Alexandre LA, 2007, LECT NOTES ARTIF INT, V4702, P358
[2]  
American Cancer Society, 2012, Cancer Facts and Figures 2012
[3]  
[Anonymous], 2005, ACR Practical Guideline, V29, P295
[4]  
[Anonymous], P ICML 2003 WORKSH L
[5]   Class prediction for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2010, 11 :523
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[9]  
Chen C., 2004, U CALIFORNIA BERKELE, V110, P24
[10]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)