An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling

被引:83
作者
Mansouri, K. [1 ,2 ]
Grulke, C. M. [2 ]
Richard, A. M. [2 ]
Judson, R. S. [2 ]
Williams, A. J. [2 ]
机构
[1] Oak Ridge Inst Sci & Educ ORISE, Oak Ridge, TN 37830 USA
[2] US EPA, Off Res & Dev, Natl Ctr Computat Toxicol, Res Triangle Pk, NC USA
关键词
data curation; standardization; QSAR modelling; physicochemical properties; Open Data; MULTICRITERIA DECISION-MAKING; PARTITION-COEFFICIENTS; APPLICABILITY DOMAIN; OUTLIER DETECTION; PLS-REGRESSION; VALIDATION; PREDICTION; SELECTION; CHEMINFORMATICS; ALGORITHMS;
D O I
10.1080/1062936X.2016.1253611
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
引用
收藏
页码:911 / 937
页数:27
相关论文
共 47 条
[41]   SMILES .3. DEPICT - GRAPHICAL DEPICTION OF CHEMICAL STRUCTURES [J].
WEININGER, D .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1990, 30 (03) :237-243
[42]   Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation [J].
Williams, Antony J. ;
Ekins, Sean ;
Tkachenko, Valery .
DRUG DISCOVERY TODAY, 2012, 17 (13-14) :685-701
[43]  
Winkler W, 1990, String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage, P354
[44]   PLS-regression:: a basic tool of chemometrics [J].
Wold, S ;
Sjöström, M ;
Eriksson, L .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2001, 58 (02) :109-130
[45]   Global QSAR modeling of LogP values of phenethylamines acting as adrenergic alpha-1 receptor agonists [J].
Yadav, Mukesh ;
Joshi, Shobha ;
Nayarisseri, Anuraj ;
Jain, Anuja ;
Hussain, Aabid ;
Dubey, Tushar .
INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES, 2013, 5 (02) :150-154
[46]   PaDEL-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints [J].
Yap, Chun Wei .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2011, 32 (07) :1466-1474
[47]   Are the Chemical Structures in Your QSAR Correct? [J].
Young, Douglas ;
Martin, Todd ;
Venkatapathy, Raghuraman ;
Harten, Paul .
QSAR & COMBINATORIAL SCIENCE, 2008, 27 (11-12) :1337-1345