Counting clusters using R-NN curves

被引:3
作者
Guha, Rajarshi [1 ]
Dutta, Debojyoti
Wild, David J.
Chen, Ting
机构
[1] Indiana Univ, Sch Informat, Bloomington, IN 47406 USA
[2] Univ So Calif, Dept Computat Biol, Los Angeles, CA 90089 USA
关键词
AQUEOUS SOLUBILITY; ORGANIC-COMPOUNDS; CONFORMATIONS; EXPLORATION; FINGERPRINT; ALGORITHM; DATABASES;
D O I
10.1021/ci600541f
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.
引用
收藏
页码:1308 / 1318
页数:11
相关论文
共 42 条
[1]   Nonlinear mapping networks [J].
Agrafiotis, DK ;
Lobanov, VS .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2000, 40 (06) :1356-1362
[2]   Exploration of biologically relevant conformations of anandamide, 2-arachidonylglycerol, and their analogues using conformational memories [J].
Barnett-Norris, J ;
Guarnieri, F ;
Hurst, DP ;
Reggio, PH .
JOURNAL OF MEDICINAL CHEMISTRY, 1998, 41 (24) :4861-4872
[3]   MULTIDIMENSIONAL BINARY SEARCH TREES USED FOR ASSOCIATIVE SEARCHING [J].
BENTLEY, JL .
COMMUNICATIONS OF THE ACM, 1975, 18 (09) :509-517
[4]   A comparative study on the application of hierarchical-agglomerative clustering approaches to organize outputs of reiterated docking runs [J].
Bottegoni, G ;
Cavalli, A ;
Recanatini, M .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (02) :852-862
[5]  
Breiman L., 1984, Classification and Regression Trees, V432, P151
[6]   Unsupervised data base clustering based on Daylight's fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets [J].
Butina, D .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1999, 39 (04) :747-750
[7]  
*CHEM COMP GROUP I, 2004, MOL OP ENV
[8]   The "Nearest single neighbor" method - Finding families of conformations within a sample [J].
Chema, D ;
Goldblum, A .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (01) :208-217
[9]   Robust ligand-based modeling of the biological targets of known drugs [J].
Cleves, Ann E. ;
Jain, Ajay N. .
JOURNAL OF MEDICINAL CHEMISTRY, 2006, 49 (10) :2921-2938
[10]   CLUSTER SEPARATION MEASURE [J].
DAVIES, DL ;
BOULDIN, DW .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1979, 1 (02) :224-227