Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity

被引:132
作者
Schneider, Nadine [1 ]
Lowe, Daniel M. [2 ]
Sayle, Roger A. [1 ]
Landrum, Gregory A. [1 ]
机构
[1] Novartis Inst BioMed Res, CH-4002 Basel, Switzerland
[2] NextMove Software Ltd, Innovat Ctr, Unit 23, Cambridge CB4 0EY, England
关键词
KNOWLEDGE; METHODOLOGY; PREDICTION; DESIGN; SYSTEM; BASE;
D O I
10.1021/ci5006614
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information.
引用
收藏
页码:39 / 53
页数:15
相关论文
共 31 条
[1]  
[Anonymous], THEILHEIMERS SYNTHET
[2]  
[Anonymous], ORGANISCHE CHEM EXPE
[3]   PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES [J].
BERNSTEIN, FC ;
KOETZLE, TF ;
WILLIAMS, GJB ;
MEYER, EF ;
BRICE, MD ;
RODGERS, JR ;
KENNARD, O ;
SHIMANOUCHI, T ;
TASUMI, M .
JOURNAL OF MOLECULAR BIOLOGY, 1977, 112 (03) :535-542
[4]   CASREACT - MORE THAN A MILLION REACTIONS [J].
BLAKE, JE ;
DANA, RC .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1990, 30 (04) :394-399
[5]  
Bolton EWY, 2008, ANN REPORTS COMPUTAT, V4
[6]  
Breiman L., 2001, J. Clin. Microbiol, V45, P5
[7]  
Broughton H. B., 2003, U. S. Patent, Patent No. [2003/0182094Al, 20030182094]
[8]   Unsupervised data base clustering based on Daylight's fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets [J].
Butina, D .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1999, 39 (04) :747-750
[9]   Analysis of the reactions used for the preparation of drug candidate molecules [J].
Carey, John S. ;
Laffan, David ;
Thomson, Colin ;
Williams, Mike T. .
ORGANIC & BIOMOLECULAR CHEMISTRY, 2006, 4 (12) :2337-2347
[10]   ATOM PAIRS AS MOLECULAR-FEATURES IN STRUCTURE ACTIVITY STUDIES - DEFINITION AND APPLICATIONS [J].
CARHART, RE ;
SMITH, DH ;
VENKATARAGHAVAN, R .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1985, 25 (02) :64-73