An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

被引：54

作者：

Hao, Ming ^{[1
]}

Wang, Yanli ^{[1
]}

Bryant, Stephen H. ^{[1
]}

机构：

[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA

来源：

ANALYTICA CHIMICA ACTA | 2014年 / 806卷

基金：

美国国家卫生研究院;

关键词：

High throughput screening; Under-sampling; Over-sampling; PubChem; Imbalanced classification; THROUGHPUT SCREENING DATA; RANDOM FOREST; LARGE SET; CLASSIFICATION; PREDICTION; MODEL; REGRESSION; SELECTION; COMPOUND; SMOTE;

D O I：

10.1016/j.aca.2013.10.050

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.

引用

页码：117 / 127

页数：11

共 50 条

[31] Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets
Rivera, William A.
INFORMATION SCIENCES, 2017, 408 : 146 - 161
[32] A topological approach for mammographic density classification using a modified synthetic minority over-sampling technique algorithm
Nedjar, Imane
Mahmoudi, Said
Chikh, Mohamed Amine
INTERNATIONAL JOURNAL OF BIOMEDICAL ENGINEERING AND TECHNOLOGY, 2022, 38 (02) : 193 - 214
[33] A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis
Yang, Fangyuan
Wang, Kang
Sun, Lisha
Zhai, Mengjiao
Song, Jiejie
Wang, Hong
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2022, 22 (01)
[34] A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis
Fangyuan Yang
Kang Wang
Lisha Sun
Mengjiao Zhai
Jiejie Song
Hong Wang
BMC Medical Informatics and Decision Making, 22
[35] Multi-fidelity model based on synthetic minority over-sampling technique
Jiuxiang Song
Jizhong Liu
Multimedia Tools and Applications, 2024, 83 : 33123 - 33139
[36] Classification of Advertisement Text on Facebook Using Synthetic Minority Over-Sampling Technique
Akkaradamrongrat, Suphamongkol
Kachamas, Pornpimon
Sinthupinyo, Sukree
2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
[37] An Over-Sampling Technique with Rejection for Imbalanced Class Learning
Lee, Jaedong
Kim, Noo-ri
Lee, Jee-Hyong
ACM IMCOM 2015, PROCEEDINGS, 2015,
[38] Multi-fidelity model based on synthetic minority over-sampling technique
Song, Jiuxiang
Liu, Jizhong
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (11) : 33123 - 33139
[39] Over-Sampling Algorithm Based on VAE in Imbalanced Classification
Zhang, Chunkai
Zhou, Ying
Chen, Yingyang
Deng, Yepeng
Wang, Xuan
Dong, Lifeng
Wei, Haoyu
CLOUD COMPUTING - CLOUD 2018, 2018, 10967 : 334 - 344
[40] Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem
Bunkhumpornpat, Chumphol
Sinapiromsaran, Krung
Lursinsap, Chidchanok
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 475 - 482

← 1 2 3 4 5 →