Classification ensembles for unbalanced class sizes in predictive toxicology

被引:44
作者
Chen, JJ [1 ]
Tsai, CA
Young, JF
Kodell, RL
机构
[1] US FDA, Div Biometry & Risk Assessment, Natl Ctr Toxicol Res, Jefferson, AR 72079 USA
[2] Acad Sinica, Inst Stat Sci, Taipei 11529, Taiwan
关键词
bagging; cross validation; ensemble classification; imbalanced data; sensitivity; specificity;
D O I
10.1080/10659360500468468
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
This paper investigates the effects of the ratio of positive-to-negative samples on the sensitivity, specificity, and concordance. When the class sizes in the training samples are not equal, the classification rule derived will favor the majority class and result in a low sensitivity on the minority class prediction. We propose an ensemble classification approach to adjust for differential class sizes in a binary classifier system. An ensemble classifier consists of a set of base classifiers; its prediction rule is based on a summary measure of individual classifications by the base classifiers. Two re-sampling methods, augmentation and abatement, are proposed to generate different bootstrap samples of equal class size to build the base classifiers. The augmentation method balances the two class sizes by bootstrapping additional samples from the minority class, whereas the abatement method balances the two class sizes by sampling only a subset of samples from the majority class. The proposed procedure is applied to a data set to predict estrogen receptor binding activity and to a data set to predict animal liver carcinogenicity using SAR (structure-activity relationship) models as base classifiers. The abatement method appears to perform well in balancing sensitivity and specificity.
引用
收藏
页码:517 / 529
页数:13
相关论文
共 26 条
[1]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[2]   The utility of structure-activity relationship (SAR) models for prediction and covariate selection in developmental toxicity: Comparative analysis of logistic regression and decision tree models [J].
Arena, VC ;
Sussman, NB ;
Mazumdar, S ;
Yu, S ;
Macina, OT .
SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2004, 15 (01) :1-18
[3]   The estrogen receptor relative binding affinities of 188 natural and xenochemicals: Structural diversity of ligands [J].
Blair, RM ;
Fang, H ;
Branham, WS ;
Hass, BS ;
Dial, SL ;
Moland, CL ;
Tong, WD ;
Shi, LM ;
Perkins, R ;
Sheehan, DM .
TOXICOLOGICAL SCIENCES, 2000, 54 (01) :138-153
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[6]  
Brieman L, 1995, CART CLASSIFICATION
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[9]   Support vector machine classification and validation of cancer tissue samples using microarray expression data [J].
Furey, TS ;
Cristianini, N ;
Duffy, N ;
Bednarski, DW ;
Schummer, M ;
Haussler, D .
BIOINFORMATICS, 2000, 16 (10) :906-914
[10]   THE CARCINOGENIC POTENCY DATABASE - ANALYSES OF 4000 CHRONIC ANIMAL CANCER EXPERIMENTS PUBLISHED IN THE GENERAL LITERATURE AND BY THE UNITED-STATES-NATIONAL-CANCER-INSTITUTE NATIONAL TOXICOLOGY PROGRAM [J].
GOLD, LS ;
SLONE, TH ;
MANLEY, NB ;
GARFINKEL, GB ;
HUDES, ES ;
ROHRBACH, L ;
AMES, BN .
ENVIRONMENTAL HEALTH PERSPECTIVES, 1991, 96 :11-15