A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications

被引:50
作者
Gadaleta, Domenico [1 ]
Lombardo, Anna [1 ]
Toma, Cosimo [1 ]
Benfenati, Emilio [1 ]
机构
[1] IRCCS, Ist Ric Farmacol Mario Negri, Dept Environm Hlth Sci, Lab Environm Chem & Toxicol, Via Masa 19, I-20156 Milan, Italy
来源
JOURNAL OF CHEMINFORMATICS | 2018年 / 10卷
关键词
QSAR; Data curation; Data cleaning; Semi-automated; Workflow; CURATION;
D O I
10.1186/s13321-018-0315-6
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets.
引用
收藏
页数:13
相关论文
共 36 条
[11]   QSAR Modeling is not "Push a Button and Find a Correlation": A Case Study of Toxicity of (Benzo-)triazoles on Algae [J].
Gramatica, Paola ;
Cassani, Stefano ;
Roy, Partha Pratim ;
Kovarich, Simona ;
Yap, Chun Wei ;
Papa, Ester .
MOLECULAR INFORMATICS, 2012, 31 (11-12) :817-835
[12]   InChI, the IUPAC International Chemical Identifier [J].
Heller, Stephen R. ;
McNaught, Alan ;
Pletnev, Igor ;
Stein, Stephen ;
Tchekhovskoi, Dmitrii .
JOURNAL OF CHEMINFORMATICS, 2015, 7
[13]  
Hersey Anne, 2015, Drug Discov Today Technol, V14, P17, DOI 10.1016/j.ddtec.2015.01.005
[14]  
International Union of Pure and Applied Chemistry (IUPAC), 2018, IUPAC INT CHEM ID IN
[15]   Assessment and validation of the CAESAR predictive model for bioconcentration factor (BCF) in fish [J].
Lombardo, Anna ;
Roncaglioni, Alessandra ;
Boriani, Elena ;
Milan, Chiara ;
Benfenati, Emilio .
CHEMISTRY CENTRAL JOURNAL, 2010, 4
[16]  
Mannhold Raimund., 2008, Molecular Modeling: Basic Principles and Applications
[17]   An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling [J].
Mansouri, K. ;
Grulke, C. M. ;
Richard, A. M. ;
Judson, R. S. ;
Williams, A. J. .
SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2016, 27 (11) :911-937
[18]   CERAPP: Collaborative Estrogen Receptor Activity Prediction Project [J].
Mansouri, Kamel ;
Abdelaziz, Ahmed ;
Rybacka, Aleksandra ;
Roncaglioni, Alessandra ;
Tropsha, Alexander ;
Varnek, Alexandre ;
Zakharov, Alexey ;
Worth, Andrew ;
Richard, Ann M. ;
Grulke, Christopher M. ;
Trisciuzzi, Daniela ;
Fourches, Denis ;
Horvath, Dragos ;
Benfenati, Emilio ;
Muratov, Eugene ;
Wedebye, Eva Bay ;
Grisoni, Francesca ;
Mangiatordi, Giuseppe F. ;
Incisivo, Giuseppina M. ;
Hong, Huixiao ;
Ng, Hui W. ;
Tetko, Igor V. ;
Balabin, Ilya ;
Kancherla, Jayaram ;
Shen, Jie ;
Burton, Julien ;
Nicklaus, Marc ;
Cassotti, Matteo ;
Nikolov, Nikolai G. ;
Nicolotti, Orazio ;
Andersson, Patrik L. ;
Zang, Qingda ;
Politi, Regina ;
Beger, Richard D. ;
Todeschini, Roberto ;
Huang, Ruili ;
Farag, Sherif ;
Rosenberg, Sine A. ;
Slavov, Svetoslav ;
Hu, Xin ;
Judson, Richard S. .
ENVIRONMENTAL HEALTH PERSPECTIVES, 2016, 124 (07) :1023-1033
[19]   Let's not forget tautomers [J].
Martin, Yvonne Connolly .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2009, 23 (10) :693-704
[20]  
National Cancer Institute Computer-Aided Drug Design (NCI/CADD) group, 2018, AID DRUG DES NCI CAD