An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

被引:54
|
作者
Hao, Ming [1 ]
Wang, Yanli [1 ]
Bryant, Stephen H. [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
High throughput screening; Under-sampling; Over-sampling; PubChem; Imbalanced classification; THROUGHPUT SCREENING DATA; RANDOM FOREST; LARGE SET; CLASSIFICATION; PREDICTION; MODEL; REGRESSION; SELECTION; COMPOUND; SMOTE;
D O I
10.1016/j.aca.2013.10.050
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.
引用
收藏
页码:117 / 127
页数:11
相关论文
共 50 条
  • [1] Classification of imbalanced PubChem BioAssay data using an efficient algorithm coupled with synthetic minority over-sampling technique
    Hao, Ming
    Wang, Yanli
    Bryant, Stephen H.
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2014, 247
  • [2] Imbalanced data classification using improved synthetic minority over-sampling technique
    Anusha, Yamijala
    Visalakshi, R.
    Srinivas, Konda
    MULTIAGENT AND GRID SYSTEMS, 2023, 19 (02) : 117 - 131
  • [3] Handling Autism Imbalanced Data using Synthetic Minority Over-Sampling Technique (SMOTE)
    El-Sayed, Asmaa Ahmed
    Meguid, Nagwa Abdel
    Mahmood, Mahmood Abdel Manem
    Hefny, Hesham Ahmed
    PROCEEDINGS OF 2015 THIRD IEEE WORLD CONFERENCE ON COMPLEX SYSTEMS (WCCS), 2015,
  • [4] Synthetic Minority Over-Sampling Technique based on Fuzzy C-means Clustering for Imbalanced Data
    Lee, Hansoo
    Jung, Seunghyan
    Kim, Minseok
    Kimt, Sungshin
    2017 INTERNATIONAL CONFERENCE ON FUZZY THEORY AND ITS APPLICATIONS (IFUZZY), 2017,
  • [5] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    2002, American Association for Artificial Intelligence (16):
  • [6] Over-sampling algorithm for imbalanced data classification
    Xu Xiaolong
    Chen Wen
    Sun Yanfei
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2019, 30 (06) : 1182 - 1191
  • [7] Over-sampling algorithm for imbalanced data classification
    XU Xiaolong
    CHEN Wen
    SUN Yanfei
    JournalofSystemsEngineeringandElectronics, 2019, 30 (06) : 1182 - 1191
  • [8] Searching for Optimal Oversampling to Process Imbalanced Data: Generative Adversarial Networks and Synthetic Minority Over-Sampling Technique
    Eom, Gayeong
    Byeon, Haewon
    MATHEMATICS, 2023, 11 (16)
  • [9] Dynamic Synthetic Minority Over-Sampling Technique-Based Rotation Forest for the Classification of Imbalanced Hyperspectral Data
    Feng, Wei
    Dauphin, Gabriel
    Huang, Wenjiang
    Quan, Yinghui
    Bao, Wenxing
    Wu, Mingquan
    Li, Qiang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2019, 12 (07) : 2159 - 2169
  • [10] Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification
    Alex, Suja A.
    Nayahi, J. Jesu Vedha
    Kaddoura, Sanaa
    APPLIED SOFT COMPUTING, 2024, 156