QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem

被引：93

作者：

Zakharov, Alexey V. ^{[1
]}

Peach, Megan L. ^{[2
]}

Sitzmann, Markus ^{[1
]}

Nicklaus, Marc C. ^{[1
]}

机构：

[1] NCI, CADD Grp, Biol Chem Lab, Ctr Canc Res,NIH,DHHS,NCI Frederick, 376 Boyles St, Frederick, MD 21702 USA

[2] Frederick Natl Lab Canc Res, Basic Sci Program, Leidos Biomed Inc, Comp Aided Drug Design Grp,Chem Biol Lab, Frederick, MD 21702 USA

来源：

JOURNAL OF CHEMICAL INFORMATION AND MODELING | 2014年 / 54卷 / 03期

基金：

美国国家卫生研究院;

关键词：

PIPELINE PILOT; RANDOM FOREST; PREDICTION;

D O I：

10.1021/ci400737s

中图分类号：

R914 [药物化学];

学科分类号：

100701 ;

摘要：

Many of the structures in Pub Chem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced Pub Chem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services (http://cactus.nci.nih.gov/chemical/apps/cap).

引用

页码：705 / 712

页数：8

共 28 条

[1] Baskin Igor I., 2008, V458, P137
[2] Bolton EE, 2010, ANN REP COMP CHEM, V4, P217, DOI 10.1016/S1574-1400(08)00012-1
[3] Random forests
Breiman, L
[J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
[4] Caruana R., 2005, International Conferences on Machine Learning, P161, DOI DOI 10.1145/1143844.1143865
[5] LIBSVM: A Library for Support Vector Machines
Chang, Chih-Chung
Lin, Chih-Jen
[J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[6] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[7] Comparison of Random Forest and Pipeline Pilot Naive Bayes in Prospective QSAR Predictions
Chen, Bin
Sheridan, Robert P.
Hornak, Viktor
Voigt, Johannes H.
[J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2012, 52 (03) : 792 - 803
[8] Chen C., 2004, U CALIFORNIA BERKELE, V110, P24
[9] In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner
Chen, Jing
Tang, Yuan Yan
Fang, Bin
Guo, Chang
[J]. JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2012, 35 : 21 - 27
[10] SUPPORT-VECTOR NETWORKS
CORTES, C
VAPNIK, V
[J]. MACHINE LEARNING, 1995, 20 (03) : 273 - 297

← 1 2 3 →