Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

被引:11
作者
Hussin, Sahar K. [1 ]
Abdelmageid, Salah M. [2 ]
Alkhalil, Adel [3 ]
Omar, Yasser M. [4 ]
Marie, Mahmoud, I [5 ]
Ramadan, Rabie A. [3 ,6 ]
机构
[1] Alshrouck Acad, Commun & Comp Engn Dept, Cairo, Egypt
[2] Taibah Univ, Comp Engn Dept, Coll Comp Sci & Engn, Medina, Saudi Arabia
[3] Univ Hail, Coll Comp Sci & Engn, Hail, Saudi Arabia
[4] Arab Acad Sci Technol & Maritime Transport, Cairo, Egypt
[5] Al Azhar Univ, Comp & Syst Engn Dept, Cairo, Egypt
[6] Cairo Univ, Comp Engn Dept, Cairo, Egypt
关键词
K-means clustering;
D O I
10.1155/2021/6675279
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data's imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.
引用
收藏
页数:15
相关论文
共 44 条
  • [1] Ensemble learning method for the prediction of new bioactive molecules
    Afolabi, Lateefat Temitope
    Saeed, Faisal
    Hashim, Haslinda
    Petinrin, Olutomilayo Olayemi
    [J]. PLOS ONE, 2018, 13 (01):
  • [2] Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity o Predictive Models Based on Imbalanced Chemical Data Sets
    Banerjee, Priyanka
    Dehnbostel, Frederic O.
    Preissner, Robert
    [J]. FRONTIERS IN CHEMISTRY, 2018, 6
  • [3] Strategies for learning in class imbalance problems
    Barandela, R
    Sánchez, JS
    García, V
    Rangel, E
    [J]. PATTERN RECOGNITION, 2003, 36 (03) : 849 - 851
  • [4] Batouche M, 2018, ENSEMBLE LEARNING LA
  • [5] Financial forecasting using support vector machines
    Cao, L
    Tay, FEH
    [J]. NEURAL COMPUTING & APPLICATIONS, 2001, 10 (02) : 184 - 192
  • [6] Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods
    Chang, Chia-Yun
    Hsu, Ming-Tsung
    Esposito, Emilio Xavier
    Tseng, Yufeng J.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2013, 53 (04) : 958 - 971
  • [7] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [8] Chemlibretextsorg, 2020, MOL MOL COMP
  • [9] Evaluation of machine-learning methods for ligand-based virtual screening
    Chen, Beining
    Harrison, Robert F.
    Papadatos, George
    Willett, Peter
    Wood, David J.
    Lewell, Xiao Qing
    Greenidge, Paulette
    Stiefl, Nikolaus
    [J]. JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2007, 21 (1-3) : 53 - 62
  • [10] Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection
    Cheng, Tiejun
    Li, Qingliang
    Wang, Yanli
    Bryant, Stephen H.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2011, 51 (02) : 229 - 236