Impact of benchmark data set topology on the validation of virtual screening methods: Exploration and quantification by spatial statistics

被引:27
作者
Rohrer, Sebastian G. [1 ]
Baumann, Knut [1 ]
机构
[1] Tech Univ Carolo Wilhelmina Braunschweig, Inst Pharmaceut Chem, D-38106 Braunschweig, Germany
关键词
D O I
10.1021/ci700099u
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
A common finding of many reports evaluating ligand-based virtual screening methods is that validation results vary considerably with changing benchmark data sets. It is widely assumed that these data set specific effects are caused by the redundancy, self- similarity, and cluster structure inherent to those data sets. These phenomena manifest themselves in the data sets' representation in descriptor space, which is termed the data set topology. A methodology for the characterization of data set topology based on spatial statistics is introduced. The method is nonparametric and can deal with arbitrary distributions of descriptor values. With this methodology it is possible to associate differences in virtual screening performance on different data sets with differences in data set topology. Moreover, the better virtual screening performance of certain descriptors can be explained by their ability of representing the benchmark data sets by a more favorable topology. Finally it is shown, that the composition of some benchmark data sets causes topologies that lead to overoptimistic validation results even in very "simple" descriptor spaces. Spatial statistics analysis as proposed here facilitates the detection of such biased data sets and may provide a tool for the future design of unbiased benchmark data sets.
引用
收藏
页码:704 / 718
页数:15
相关论文
共 64 条
[11]   Use of structure Activity data to compare structure-based clustering methods and descriptors for use in compound selection [J].
Brown, RD ;
Martin, YC .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1996, 36 (03) :572-584
[12]  
*CHEM COMP GROUP I, 2002, MOE MOL OP ENV 2003
[13]   Robust ligand-based modeling of the biological targets of known drugs [J].
Cleves, Ann E. ;
Jain, Ajay N. .
JOURNAL OF MEDICINAL CHEMISTRY, 2006, 49 (10) :2921-2938
[14]   D-optimal designs [J].
deAguiar, PF ;
Bourguignon, B ;
Khots, MS ;
Massart, DL ;
PhanThanLuu, R .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1995, 30 (02) :199-210
[15]  
DIGGLE PJ, 1979, SPATIAL TEMPORAL ANA, P95
[16]  
Fortin MJ, 2005, SPATIAL ANALYSIS: A GUIDE FOR ECOLOGISTS
[17]   A distance function for retrieval of active molecules from complex chemical space representations [J].
Godden, Jeffrey W. ;
Bajorath, Jurgen .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (03) :1094-1097
[18]   Measuring CAMD technique performance: A virtual screening case study in the design of validation experiments [J].
Good, AC ;
Hermsmeier, MA ;
Hindle, SA .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2004, 18 (7-9) :529-536
[19]   Measuring CAMD technique performance. 2. How "druglike" are drugs? Implications of random test set selection exemplified using druglikeness classification models [J].
Good, Andrew C. ;
Hermsmeier, Mark A. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (01) :110-114
[20]   Counting clusters using R-NN curves [J].
Guha, Rajarshi ;
Dutta, Debojyoti ;
Wild, David J. ;
Chen, Ting .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (04) :1308-1318