Multiple-rule bias in the comparison of classification rules

被引:11
作者
Yousefi, Mohammadmahdi R. [1 ]
Hua, Jianping [2 ]
Dougherty, Edward R. [1 ,2 ,3 ]
机构
[1] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
[2] Translat Genom Res Inst, Computat Biol Div, Phoenix, AZ 85004 USA
[3] Univ Texas MD Anderson Canc Ctr, Dept Bioinformat & Computat Biol, Houston, TX 77030 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
MICROARRAY DATA; DISCRIMINANT-ANALYSIS; OVER-OPTIMISM; ERROR RATE; PREDICTION; VALIDATION; SELECTION; BIOINFORMATICS; PERFORMANCE; GENOMICS;
D O I
10.1093/bioinformatics/btr262
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule. Results: This article provides a careful probabilistic analysis of the second issue and the 'multiple-rule bias', resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators.
引用
收藏
页码:1675 / 1683
页数:9
相关论文
共 27 条
[1]   Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction [J].
Boulesteix, Anne-Laure ;
Strobl, Carolin .
BMC MEDICAL RESEARCH METHODOLOGY, 2009, 9
[2]   Over-optimism in bioinformatics research [J].
Boulesteix, Anne-Laure .
BIOINFORMATICS, 2010, 26 (03) :437-439
[3]  
BRAGA UM, 2006, PATTERN RECOGN, V38, P1799
[4]   Exact correlation between actual and estimated errors in discrete classification [J].
Braga-Neto, Ulisses M. ;
Dougherty, Edward R. .
PATTERN RECOGNITION LETTERS, 2010, 31 (05) :407-412
[5]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[6]   Novel endothelial cell markers in hepatocellular carcinoma [J].
Chen, X ;
Higgins, J ;
Cheung, ST ;
Li, R ;
Mason, V ;
Montgomery, K ;
Fan, ST ;
van de Rijn, M ;
So, S .
MODERN PATHOLOGY, 2004, 17 (10) :1198-1210
[7]  
DAVISON AC, 1992, BIOMETRIKA, V79, P279
[8]   On the epistemological crisis in genomics [J].
Dougherty, Edward R. .
CURRENT GENOMICS, 2008, 9 (02) :69-79
[9]   Validation of computational methods in genomics [J].
Dougherty, Edward R. ;
Hua, Jianping ;
Bittner, Michael L. .
CURRENT GENOMICS, 2007, 8 (01) :1-19
[10]   Epistemology of computational biology: Mathematical models and experimental prediction as the basis of their validity [J].
Dougherty, ER ;
Braga-Neto, U .
JOURNAL OF BIOLOGICAL SYSTEMS, 2006, 14 (01) :65-90