An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

被引:54
|
作者
Hao, Ming [1 ]
Wang, Yanli [1 ]
Bryant, Stephen H. [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
High throughput screening; Under-sampling; Over-sampling; PubChem; Imbalanced classification; THROUGHPUT SCREENING DATA; RANDOM FOREST; LARGE SET; CLASSIFICATION; PREDICTION; MODEL; REGRESSION; SELECTION; COMPOUND; SMOTE;
D O I
10.1016/j.aca.2013.10.050
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.
引用
收藏
页码:117 / 127
页数:11
相关论文
共 50 条
  • [21] DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique
    Chumphol Bunkhumpornpat
    Krung Sinapiromsaran
    Chidchanok Lursinsap
    Applied Intelligence, 2012, 36 : 664 - 684
  • [22] Diversity and Separable Metrics in Over-Sampling Technique for Imbalanced Data Classification
    Mahmoudi, Shadi
    Moradi, Parham
    Akhlaghian, Fardin
    Moradi, Rizan
    2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2014, : 152 - 158
  • [23] A New Over-sampling Technique Based on SVM for Imbalanced Diseases Data
    Wang, Jinjin
    Yao, Yukai
    Zhou, Hanhai
    Leng, Mingwei
    Chen, Xiaoyun
    PROCEEDINGS 2013 INTERNATIONAL CONFERENCE ON MECHATRONIC SCIENCES, ELECTRIC ENGINEERING AND COMPUTER (MEC), 2013, : 1224 - 1228
  • [24] SYNTHETIC MINORITY OVER-SAMPLING TECHNIQUE BASED ROTATION FOREST FOR THE CLASSIFICATION OF UNBALANCED HYPERSPECTRAL DATA
    Feng, Wei
    Huang, Wenjiang
    Ye, Huichun
    Zhao, Longlong
    IGARSS 2018 - 2018 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2018, : 2651 - 2654
  • [25] A clustered borderline synthetic minority over-sampling technique for balancing quick access recorder data
    Li, Kunpeng
    Xu, Junjie
    Zhao, Huimin
    Deng, Wu
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (04) : 6849 - 6862
  • [26] Learning from Imbalanced Data Using Over-Sampling and the Firefly Algorithm
    Czarnowski, Ireneusz
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2021), 2021, 12876 : 373 - 386
  • [27] Predicting Seminal Quality via Imbalanced Learning with Evolutionary Safe-Level Synthetic Minority Over-Sampling Technique
    Jieming Ma
    David Olalekan Afolabi
    Jie Ren
    Aiyan Zhen
    Cognitive Computation, 2021, 13 : 833 - 844
  • [28] Predicting Seminal Quality via Imbalanced Learning with Evolutionary Safe-Level Synthetic Minority Over-Sampling Technique
    Ma, Jieming
    Afolabi, David Olalekan
    Ren, Jie
    Zhen, Aiyan
    COGNITIVE COMPUTATION, 2021, 13 (04) : 833 - 844
  • [29] Neighborhood Triangular Synthetic Minority Over-sampling Technique for Imbalanced Prediction on Small Samples of Chinese Tourism and Hospitality Firms
    Xu, Yu-Hui
    Li, Hui
    Le, Lu-Ping
    Tian, Xiao-Yun
    2014 SEVENTH INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL SCIENCES AND OPTIMIZATION (CSO), 2014, : 534 - 538
  • [30] Cluster-Based Minority Over-Sampling for Imbalanced Datasets
    Puntumapon, Kamthorn
    Rakthamamon, Thanawin
    Waiyamai, Kitsana
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (12): : 3101 - 3109