An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

被引:54
|
作者
Hao, Ming [1 ]
Wang, Yanli [1 ]
Bryant, Stephen H. [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
High throughput screening; Under-sampling; Over-sampling; PubChem; Imbalanced classification; THROUGHPUT SCREENING DATA; RANDOM FOREST; LARGE SET; CLASSIFICATION; PREDICTION; MODEL; REGRESSION; SELECTION; COMPOUND; SMOTE;
D O I
10.1016/j.aca.2013.10.050
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.
引用
收藏
页码:117 / 127
页数:11
相关论文
共 50 条
  • [41] A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets
    Piri, Saeed
    Delen, Dursun
    Liu, Tieming
    DECISION SUPPORT SYSTEMS, 2018, 106 : 15 - 29
  • [42] Deep Over-sampling Framework for Classifying Imbalanced Data
    Ando, Shin
    Huang, Chun Yuan
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2017, PT I, 2017, 10534 : 770 - 785
  • [43] A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets
    Rivera, William A.
    Xanthopoulos, Petros
    EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 : 124 - 135
  • [44] Over-sampling methods for mixed data in imbalanced problems
    Alonso, Hugo
    da Costa, Joaquim Fernando Pinto
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024,
  • [45] Employee attrition prediction with convolutional neural network and synthetic minority over-sampling technique
    Duan, Lian
    Paknejad, Javad
    Kim, Hak
    JOURNAL OF BUSINESS ANALYTICS, 2025, 8 (01) : 24 - 35
  • [46] LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data
    Nakamura, Munehiro
    Kajiwara, Yusuke
    Otsuka, Atsushi
    Kimura, Haruhiko
    BIODATA MINING, 2013, 6
  • [47] On the Use of Surrounding Neighbors for Synthetic Over-Sampling of the Minority Class
    Garcia, V.
    Sanchez, J. S.
    Mollineda, R. A.
    SMO 08: PROCEEDINGS OF THE 8TH WSEAS INTERNATIONAL CONFERENCE ON SIMULATION, MODELLING AND OPTIMIZATION, 2008, : 389 - +
  • [48] Safe Level Graph for Synthetic Minority Over-sampling Techniques
    Bunkhumpornpat, Chumphol
    Subpaiboonkit, Sitthichoke
    2013 13TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT): COMMUNICATION AND INFORMATION TECHNOLOGY FOR NEW LIFE STYLE BEYOND THE CLOUD, 2013, : 570 - 575
  • [49] Scholarship Recipients Prediction Model using k-Nearest Neighbor Algorithm and Synthetic Minority Over-sampling Technique
    Kurniadi, Dede
    Nuraeni, Fitri
    Abania, Nia
    Fitriani, Leni
    Mulyani, Asri
    Agustin, Yoga Handoko
    2022 12TH INTERNATIONAL CONFERENCE ON SYSTEM ENGINEERING AND TECHNOLOGY (ICSET 2022), 2022, : 89 - 94
  • [50] Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE
    Chen, Junfeng
    Zheng, Zhongtuan
    Computer Engineering and Applications, 2024, 57 (23) : 106 - 112