Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework

被引:5
|
作者
Kanj, Sawsan [1 ,2 ,3 ,4 ,5 ]
Bruls, Thomas [1 ,3 ,4 ,5 ]
Gazut, Stephane [2 ]
机构
[1] CEA, Genoscope, 2 Rue Gaston Cremieux, F-91057 Evry, France
[2] CEA, LIST, Lab Anal Donnees & Intelligence Syst, Gif Sur Yvette, France
[3] Univ Evry, Evry, France
[4] CNRS, UMR 8030, Evry, France
[5] Univ Paris Saclay, Evry, France
关键词
density-based methods; metagenomic data; sequence clustering; locality sensitive hashing;
D O I
10.1089/cmb.2017.0113
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We present a new algorithm to cluster high-dimensional sequence data and its application to the field of metagenomics, which aims at reconstructing individual genomes from a mixture of genomes sampled from an environmental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, for example, using the shared nearest neighbors (SNN) rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new approach based on combining the SNN rule with the concept of locality sensitive hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and employing the SNN rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.
引用
收藏
页码:236 / 250
页数:15
相关论文
共 50 条
  • [1] Secure Approximate Nearest Neighbor Search with Locality-Sensitive Hashing
    Song, Shang
    Liu, Lin
    Chen, Rongmao
    Peng, Wei
    Wang, Yi
    COMPUTER SECURITY - ESORICS 2023, PT III, 2024, 14346 : 411 - 430
  • [2] Theoretical Analysis on Pruning Nearest Neighbor Candidates by Locality Sensitive Hashing
    Mutoh, Tomoyuki
    Iwamura, Masakazu
    Kise, Koichi
    TENCON 2010: 2010 IEEE REGION 10 CONFERENCE, 2010, : 1027 - 1030
  • [3] Improved locality-sensitive hashing method for the approximate nearest neighbor problem
    Lu Ying-Hua
    Ma Ting-Huai
    Zhong Shui-Ming
    Cao Jie
    Wang Xin
    Al-Dhelaan, Abdullah
    CHINESE PHYSICS B, 2014, 23 (08)
  • [4] Improved locality-sensitive hashing method for the approximate nearest neighbor problem
    陆颖华
    马廷淮
    钟水明
    曹杰
    王新
    Abdullah Al-Dhelaane
    Chinese Physics B, 2014, (08) : 221 - 229
  • [5] Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search
    Huang, Qiang
    Feng, Jianlin
    Zhang, Yikai
    Fang, Qiong
    Ng, Wilfred
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (01): : 1 - 12
  • [6] Bi-level Locality Sensitive Hashing for k-Nearest Neighbor Computation
    Pan, Jia
    Manocha, Dinesh
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 378 - 389
  • [7] Fuzzy Shared Nearest Neighbor Clustering
    Rika Sharma
    Kesari Verma
    International Journal of Fuzzy Systems, 2019, 21 : 2667 - 2678
  • [8] Fuzzy Shared Nearest Neighbor Clustering
    Sharma, Rika
    Verma, Kesari
    INTERNATIONAL JOURNAL OF FUZZY SYSTEMS, 2019, 21 (08) : 2667 - 2678
  • [9] Locality-sensitive hashing for finding nearest neighbors
    Slaney, Malcolm
    Casey, Michael
    IEEE SIGNAL PROCESSING MAGAZINE, 2008, 25 (02) : 128 - 131
  • [10] Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches
    Jafari, Omid
    Nagarkar, Parth
    DATABASES THEORY AND APPLICATIONS (ADC 2021), 2021, 12610 : 62 - 73