QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem

被引:93
作者
Zakharov, Alexey V. [1 ]
Peach, Megan L. [2 ]
Sitzmann, Markus [1 ]
Nicklaus, Marc C. [1 ]
机构
[1] NCI, CADD Grp, Biol Chem Lab, Ctr Canc Res,NIH,DHHS,NCI Frederick, 376 Boyles St, Frederick, MD 21702 USA
[2] Frederick Natl Lab Canc Res, Basic Sci Program, Leidos Biomed Inc, Comp Aided Drug Design Grp,Chem Biol Lab, Frederick, MD 21702 USA
基金
美国国家卫生研究院;
关键词
PIPELINE PILOT; RANDOM FOREST; PREDICTION;
D O I
10.1021/ci400737s
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Many of the structures in Pub Chem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced Pub Chem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services (http://cactus.nci.nih.gov/chemical/apps/cap).
引用
收藏
页码:705 / 712
页数:8
相关论文
共 28 条
  • [1] Baskin Igor I., 2008, V458, P137
  • [2] Bolton EE, 2010, ANN REP COMP CHEM, V4, P217, DOI 10.1016/S1574-1400(08)00012-1
  • [3] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [4] Caruana R., 2005, International Conferences on Machine Learning, P161, DOI DOI 10.1145/1143844.1143865
  • [5] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [6] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [7] Comparison of Random Forest and Pipeline Pilot Naive Bayes in Prospective QSAR Predictions
    Chen, Bin
    Sheridan, Robert P.
    Hornak, Viktor
    Voigt, Johannes H.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2012, 52 (03) : 792 - 803
  • [8] Chen C., 2004, U CALIFORNIA BERKELE, V110, P24
  • [9] In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner
    Chen, Jing
    Tang, Yuan Yan
    Fang, Bin
    Guo, Chang
    [J]. JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2012, 35 : 21 - 27
  • [10] SUPPORT-VECTOR NETWORKS
    CORTES, C
    VAPNIK, V
    [J]. MACHINE LEARNING, 1995, 20 (03) : 273 - 297