A modification of the Jaccard-Tanimoto similarity index for diverse selection of chemical compounds using binary strings

被引:95
作者
Fligner, MA [1 ]
Verducci, JS
Blower, PE
机构
[1] Ohio State Univ, Dept Stat, Columbus, OH 43210 USA
[2] LeadScope Inc, Columbus, OH 43212 USA
关键词
binary data chemical fingerprints; data mining; measures of association; optimal design;
D O I
10.1198/004017002317375064
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Determination of molecular similarity plays an important role in analyzing large compound databases in chemical and pharmaceutical research. When molecules are described by binary vectors with bits corresponding to the presence or absence of structural features, the Tanimoto association coefficient is the most commonly used measure of similarity or chemical distance between two compounds. However, when used to select compounds for an optimal spread design, the Tanimoto coefficient produces an intrinsic bias toward smaller compounds. We have developed a new association coefficient that overcomes this bias. This article gives details of the new coefficient and contrasts the two coefficients for selecting diverse sets of compounds from a large collection. When the Tanimoto coefficient is modified as suggested to select a diverse set in the National Cancer Institute and Registry of Toxic Effects of Chemical Substances databases, the average number of features among the selected compounds increases by more than 50%.
引用
收藏
页码:110 / 119
页数:10
相关论文
共 15 条
  • [1] The hidden component of size in two-dimensional fragment descriptors: Side effects on sampling in bioactive libraries
    Dixon, SL
    Koehler, RT
    [J]. JOURNAL OF MEDICINAL CHEMISTRY, 1999, 42 (15) : 2887 - 2900
  • [2] DOWNS GM, 1995, REV COMP CH, V7, P1
  • [3] On the properties of bit string-based measures of chemical similarity
    Flower, DR
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1998, 38 (03): : 379 - 386
  • [4] Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients
    Godden, JW
    Xue, L
    Bajorath, J
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2000, 40 (01): : 163 - 166
  • [5] Gower J. C., 1985, Encyclopedia of statistical sciences, VVol. 5, P397
  • [6] Experimental designs for selecting molecules from large chemical databases
    Higgs, RE
    Bemis, KG
    Watson, IA
    Wikel, JH
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1997, 37 (05): : 861 - 870
  • [7] Jaccard P, 1908, Bull. Soc. Vaud. Sci. Natur. Bull. Vaud. Soc. Nat. Sci., V44, P223, DOI DOI 10.5169/SEALS-268384
  • [8] Johnson M., 1990, CONCEPTS APPL MOL SI
  • [9] Lajiness MS, 1997, PERSPECT DRUG DISCOV, V7-8, P65
  • [10] Tanimoto T. T., 1958, An Elementary Mathematical Theory of Classification and Prediction