An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

被引:54
|
作者
Hao, Ming [1 ]
Wang, Yanli [1 ]
Bryant, Stephen H. [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
High throughput screening; Under-sampling; Over-sampling; PubChem; Imbalanced classification; THROUGHPUT SCREENING DATA; RANDOM FOREST; LARGE SET; CLASSIFICATION; PREDICTION; MODEL; REGRESSION; SELECTION; COMPOUND; SMOTE;
D O I
10.1016/j.aca.2013.10.050
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.
引用
收藏
页码:117 / 127
页数:11
相关论文
共 50 条
  • [31] Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets
    Rivera, William A.
    INFORMATION SCIENCES, 2017, 408 : 146 - 161
  • [32] A topological approach for mammographic density classification using a modified synthetic minority over-sampling technique algorithm
    Nedjar, Imane
    Mahmoudi, Said
    Chikh, Mohamed Amine
    INTERNATIONAL JOURNAL OF BIOMEDICAL ENGINEERING AND TECHNOLOGY, 2022, 38 (02) : 193 - 214
  • [33] A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis
    Yang, Fangyuan
    Wang, Kang
    Sun, Lisha
    Zhai, Mengjiao
    Song, Jiejie
    Wang, Hong
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2022, 22 (01)
  • [34] A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis
    Fangyuan Yang
    Kang Wang
    Lisha Sun
    Mengjiao Zhai
    Jiejie Song
    Hong Wang
    BMC Medical Informatics and Decision Making, 22
  • [35] Multi-fidelity model based on synthetic minority over-sampling technique
    Jiuxiang Song
    Jizhong Liu
    Multimedia Tools and Applications, 2024, 83 : 33123 - 33139
  • [36] Classification of Advertisement Text on Facebook Using Synthetic Minority Over-Sampling Technique
    Akkaradamrongrat, Suphamongkol
    Kachamas, Pornpimon
    Sinthupinyo, Sukree
    2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [37] An Over-Sampling Technique with Rejection for Imbalanced Class Learning
    Lee, Jaedong
    Kim, Noo-ri
    Lee, Jee-Hyong
    ACM IMCOM 2015, PROCEEDINGS, 2015,
  • [38] Multi-fidelity model based on synthetic minority over-sampling technique
    Song, Jiuxiang
    Liu, Jizhong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (11) : 33123 - 33139
  • [39] Over-Sampling Algorithm Based on VAE in Imbalanced Classification
    Zhang, Chunkai
    Zhou, Ying
    Chen, Yingyang
    Deng, Yepeng
    Wang, Xuan
    Dong, Lifeng
    Wei, Haoyu
    CLOUD COMPUTING - CLOUD 2018, 2018, 10967 : 334 - 344
  • [40] Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem
    Bunkhumpornpat, Chumphol
    Sinapiromsaran, Krung
    Lursinsap, Chidchanok
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 475 - 482