Data mining a small molecule drug screening representative subset from NIH PubChem

被引:58
作者
Xie, Xiang-Qun [1 ,2 ,3 ]
Chen, Jian-Zhong [1 ]
机构
[1] Univ Pittsburgh, Sch Pharm, Dept Pharmaceut Sci, Pittsburgh Mol Lib Screening Ctr,Drug Discovery I, Pittsburgh, PA 15260 USA
[2] Univ Pittsburgh, Dept Computat Biol, Pittsburgh, PA 15260 USA
[3] Univ Pittsburgh, Dept Biol Struct, Pittsburgh, PA 15260 USA
关键词
D O I
10.1021/ci700193u
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
PubChem is a scientific showcase of the NIH Roadmap Initiatives. It is a compound repository created to facilitate information exchange and data sharing among the NIH Roadmap-funded Molecular Library Screening Center Network (MLSCN) and the scientific community. However, PubChem has more than 10 million records of compound information. It will be challenging to conduct a drug screening of the whole database of millions of compounds. Thus, the purpose of the present study was to develop a data mining cheminformatics approach in order to construct a representative and structure-diverse sublibrary from the large PubChem database. In this study, a new chemical diverse representative subset, rePubChem, was selected by whole-molecule chemistry-space matrix calculation using the cell-based partition algorithm. The representative subset was generated and was then subjected to evaluations by compound property analyses based on 1D and 2D molecular descriptors. The new subset was also examined and assessed for self-similarity analysis based on 2D molecular fingerprints in comparing with the source compound library. The new subset has a much smaller library size (540K compounds) with minimum similarity and redundancy without loss of the structural diversity and basic molecular properties of its parent library (5.3 million compounds). The new representative subset library generated could be a valuable structure-diverse compound resource for in silico virtual screening and in vitro HTS drug screening. In addition, the established subset generation method of using the combined cell-based chemistry-space partition metrics with pairwised 2D fingerprint-based similarity search approaches will also be important to a broad scientific community interested in acquiring structurally diverse compounds for efficient drug screening, building representative virtual combinatorial chemistry libraries for syntheses, and data mining large compound databases like the PubChem library in general.
引用
收藏
页码:465 / 475
页数:11
相关论文
共 32 条
[1]   Similarity based virtual screening: A tool for targeted library design [J].
Alvesalo, JKO ;
Siiskonen, A ;
Vainio, MJ ;
Tammela, PSM ;
Vuorela, PM .
JOURNAL OF MEDICINAL CHEMISTRY, 2006, 49 (07) :2353-2356
[2]  
Armstrong J. W., 1999, AM BIOTECHNOL LAB, V17, P26
[3]   MOLECULAR-IDENTIFICATION NUMBER FOR SUBSTRUCTURE SEARCHES [J].
BURDEN, FR .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1989, 29 (03) :225-227
[4]   OptiSim: An extended dissimilarity selection method for finding diverse representative subsets [J].
Clark, RD .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1997, 37 (06) :1181-1188
[5]   Reliability of logP predictions based on calculated molecular descriptors:: A critical review [J].
Erös, D ;
Kövesdi, I ;
Örfi, L ;
Takács-Novák, K ;
Acsády, G ;
Kéri, G .
CURRENT MEDICINAL CHEMISTRY, 2002, 9 (20) :1819-1829
[6]   On the properties of bit string-based measures of chemical similarity [J].
Flower, DR .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1998, 38 (03) :379-386
[7]   CLUSTERING USING A SIMILARITY MEASURE BASED ON SHARED NEAR NEIGHBORS [J].
JARVIS, RA ;
PATRICK, EA .
IEEE TRANSACTIONS ON COMPUTERS, 1973, C-22 (11) :1025-1034
[8]  
Johnson M., 1990, CONCEPTS APPL MOL SI
[9]  
JOHNSON MA, 1995, METH PRIN MED CHEM, V3, P89
[10]  
Kier L.H., 1986, Molecular Connectivity in Structure-Activity Analysis