Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification

被引:72
作者
Oh, Sangyoon [1 ]
Lee, Min Su [2 ,3 ]
Zhang, Byoung-Tak [2 ,3 ]
机构
[1] Ajou Univ, Div Informat & Comp Engn, WISE Lab, Suwon 443749, Kyeonggi, South Korea
[2] Seoul Natl Univ, CBIT, Seoul 151742, South Korea
[3] Seoul Natl Univ, Sch Engn & Comp Sci, Seoul 151742, South Korea
关键词
Bioinformatics; classification; interactive data exploration and discovery; mining methods and algorithms; DISCOVERY;
D O I
10.1109/TCBB.2010.96
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In biomedical data, the imbalanced data problem occurs frequently and causes poor prediction performance for minority classes. It is because the trained classifiers are mostly derived from the majority class. In this paper, we describe an ensemble learning method combined with active example selection to resolve the imbalanced data problem. Our method consists of three key components: 1) an active example selection algorithm to choose informative examples for training the classifier, 2) an ensemble learning method to combine variations of classifiers derived by active example selection, and 3) an incremental learning scheme to speed up the iterative training procedure for active example selection. We evaluate the method on six real-world imbalanced data sets in biomedical domains, showing that the proposed method outperforms both the random under sampling and the ensemble with under sampling methods. Compared to other approaches to solving the imbalanced data problem, our method excels by 0.03-0.15 points in AUC measure.
引用
收藏
页码:316 / 325
页数:10
相关论文
共 28 条
  • [1] [Anonymous], P SIAM INT C DAT MIN
  • [2] [Anonymous], 2007, ICML, DOI DOI 10.1145/1273496.1273614
  • [3] [Anonymous], 2007, Uci machine learning repository
  • [4] [Anonymous], P 14 INT C GEN INF D
  • [5] Batista G. E., 2004, ACM SIGKDD Explor. Newslett., P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
  • [6] An empirical comparison of voting classification algorithms: Bagging, boosting, and variants
    Bauer, E
    Kohavi, R
    [J]. MACHINE LEARNING, 1999, 36 (1-2) : 105 - 139
  • [7] Operations for Learning with Graphical Models
    Buntine, Wray L.
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1994, 2 : 159 - 225
  • [8] Glycosylation site prediction using ensembles of Support Vector Machine classifiers
    Caragea, Cornelia
    Sinapov, Jivko
    Silvescu, Adrian
    Dobbs, Drena
    Honavar, Vasant
    [J]. BMC BIOINFORMATICS, 2007, 8 (1)
  • [9] Cestnik Bojan., 1987, EWSL, P31
  • [10] Chawla N. V., 2004, ACM SIGKDD Explorations Newsletter, V6, P1