Counting clusters using R-NN curves

被引:3
作者
Guha, Rajarshi [1 ]
Dutta, Debojyoti
Wild, David J.
Chen, Ting
机构
[1] Indiana Univ, Sch Informat, Bloomington, IN 47406 USA
[2] Univ So Calif, Dept Computat Biol, Los Angeles, CA 90089 USA
关键词
AQUEOUS SOLUBILITY; ORGANIC-COMPOUNDS; CONFORMATIONS; EXPLORATION; FINGERPRINT; ALGORITHM; DATABASES;
D O I
10.1021/ci600541f
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.
引用
收藏
页码:1308 / 1318
页数:11
相关论文
共 42 条
[41]  
Willett P., 1987, SIMILARITY CLUSTERIN
[42]   Visualization of large-scale aqueous solubility data using a novel hierarchical data visualization technique [J].
Yamashita, Fumiyoshi ;
Itoh, Takayuki ;
Hara, Hideto ;
Hashida, Mitsuru .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (03) :1054-1059