An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

被引:54
作者
Hao, Ming [1 ]
Wang, Yanli [1 ]
Bryant, Stephen H. [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
High throughput screening; Under-sampling; Over-sampling; PubChem; Imbalanced classification; THROUGHPUT SCREENING DATA; RANDOM FOREST; LARGE SET; CLASSIFICATION; PREDICTION; MODEL; REGRESSION; SELECTION; COMPOUND; SMOTE;
D O I
10.1016/j.aca.2013.10.050
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.
引用
收藏
页码:117 / 127
页数:11
相关论文
共 55 条
  • [1] ChemMine tools: an online service for analyzing and clustering small molecules
    Backman, Tyler W. H.
    Cao, Yiqun
    Girke, Thomas
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 : W486 - W491
  • [2] microPred: effective classification of pre-miRNAs for human miRNA gene prediction
    Batuwita, Rukshan
    Palade, Vasile
    [J]. BIOINFORMATICS, 2009, 25 (08) : 989 - 995
  • [3] SMOTE for high-dimensional class-imbalanced data
    Blagus, Rok
    Lusa, Lara
    [J]. BMC BIOINFORMATICS, 2013, 14
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] Breiman L, 1998, TECHNICAL REPORT
  • [6] Boosting algorithms: Regularization, prediction and model fitting
    Buehlmann, Peter
    Hothorn, Torsten
    [J]. STATISTICAL SCIENCE, 2007, 22 (04) : 477 - 505
  • [7] Boosting with the L2 loss:: Regression and classification
    Bühlmann, P
    Yu, B
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2003, 98 (462) : 324 - 339
  • [8] Boosting for high-dimensional linear models
    Buhlmann, Peter
    [J]. ANNALS OF STATISTICS, 2006, 34 (02) : 559 - 583
  • [9] Identifying SNPs predictive of phenotype using random forests
    Bureau, A
    Dupuis, J
    Falls, K
    Lunetta, KL
    Hayward, B
    Keith, TP
    Van Eerdewegh, P
    [J]. GENETIC EPIDEMIOLOGY, 2005, 28 (02) : 171 - 182
  • [10] Financial forecasting using support vector machines
    Cao, L
    Tay, FEH
    [J]. NEURAL COMPUTING & APPLICATIONS, 2001, 10 (02) : 184 - 192