SketchSort: Fast All Pairs Similarity Search for Large Databases of Molecular Fingerprints

被引:14
作者
Tabei, Yasuo [1 ]
Tsuda, Koji [1 ,2 ]
机构
[1] Japan Sci & Technol Agcy, ERATO, Minato Discrete Struct Manipulat Syst Project, Sapporo, Hokkaido 0600814, Japan
[2] Natl Inst Adv Ind Sci & Technol, Computat Biol Res Ctr, Tokyo 1350064, Japan
关键词
All pairs similarity search; SketchSort; Multiple sorting; ALGORITHMS;
D O I
10.1002/minf.201100050
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Similarity networks of ligands are often reported useful in predicting chemical activities and target proteins. However, the naive method of computing all pairwise similarities of chemical fingerprints takes quadratic time, which is prohibitive for large scale databases with millions of ligands. We propose a fast all pairs similarity search method, called SketchSort, that maps chemical fingerprints to symbol strings with random projections, and finds similar strings by multiple masked sorting. Due to random projection, SketchSort misses a certain fraction of neighbors (i.e., false negatives). Nevertheless, the expected fraction of false negatives is theoretically derived and can be kept under a very small value. Experiments show that SketchSort is much faster than other similarity search methods and enables us to obtain a PubChem-scale similarity network quickly.
引用
收藏
页码:801 / 807
页数:7
相关论文
共 18 条
  • [1] GENERALIZED STRING MATCHING
    ABRAHAMSON, K
    [J]. SIAM JOURNAL ON COMPUTING, 1987, 16 (06) : 1039 - 1051
  • [2] [Anonymous], 2007, An introduction to chemoinformatics
  • [3] AUNG Z, 2010, P 22 INT C SCI STAT, P288
  • [4] Speeding up chemical database searches using a proximity filter based on the logical exclusive OR
    Baldi, Pierre
    Hirschberg, Daniel S.
    Nasr, Ramzi J.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2008, 48 (07) : 1367 - 1378
  • [5] CLUSTERING OF CHEMICAL STRUCTURES ON THE BASIS OF 2-DIMENSIONAL SIMILARITY MEASURES
    BARNARD, JM
    DOWNS, GM
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1992, 32 (06): : 644 - 649
  • [6] Min-wise independent permutations
    Broder, AZ
    Charikar, M
    Frieze, AM
    Mitzenmacher, M
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2000, 60 (03) : 630 - 659
  • [7] Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
    Cao, Yiqun
    Jiang, Tao
    Girke, Thomas
    [J]. BIOINFORMATICS, 2010, 26 (07) : 953 - 959
  • [8] Clustering methods and their uses in computational chemistry
    Downs, GM
    Barnard, JM
    [J]. REVIEWS IN COMPUTATIONAL CHEMISTRY, VOL 18, 2002, 18 : 1 - 40
  • [9] Relating protein pharmacology by ligand chemistry
    Keiser, Michael J.
    Roth, Bryan L.
    Armbruster, Blaine N.
    Ernsberger, Paul
    Irwin, John J.
    Shoichet, Brian K.
    [J]. NATURE BIOTECHNOLOGY, 2007, 25 (02) : 197 - 206
  • [10] Chemical space
    Kirkpatrick, P
    Ellis, C
    [J]. NATURE, 2004, 432 (7019) : 823 - 823