Accelerating Clustering using Approximate Spanning Tree and Prime Number based Filter

被引:0
作者
Rao, Dhananjai M. [1 ]
Sreeskandarajan, Sutharzan [2 ]
Liang, Chun [2 ]
机构
[1] Miami Univ, CSE Dept, Oxford, OH 45056 USA
[2] Miami Univ, Dept Biol, Oxford, OH 45056 USA
来源
2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2019年
关键词
Clustering; Minimum Spanning Tree; d2; ALIGNMENT; VISUALIZATION; SEQUENCES; EVOLUTION; RESOURCE;
D O I
10.1109/IPDPSW.2019.00037
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Motivation: Clustering genomic data, including those generated via high-throughput sequencing, is an important preliminary step for assembly and analysis. However, clustering a large number of sequences is time-consuming. Methods: In this paper, we discuss algorithmic performance improvements to our existing clustering system called PEACE via the following two new approaches: (1) using Approximate Spanning Tree (AST) that is computed much faster than the currently used Minimum Spanning Tree (MST) approach, and (2) a novel Prime Numbers based Heuristic (PNH) for generating features and comparing them to further reduce comparison overheads. Results: Experiments conducted using a variety of data sets show that the proposed method significantly improves performance for datasets with large clusters with only minimal degradation in clustering quality. We also compare our methods against wcd-kaboom, a state-of-the-art clustering software. Our experiments show that with AST and PNH underperform wcd-kaboom for datasets that have many small clusters. However, they significantly outperform wcd-kaboom for datasets with large clusters by a conspicuous similar to 550xwith comparable clustering quality. The results indicate that the proposed methods hold considerable promise for accelerating clustering of genomic data with large clusters.
引用
收藏
页码:166 / 174
页数:9
相关论文
共 31 条
  • [1] Median-joining networks for inferring intraspecific phylogenies
    Bandelt, HJ
    Forster, P
    Röhl, A
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 1999, 16 (01) : 37 - 48
  • [2] Phylogeography and evolution of the Tanganyikan cichlid genus Tropheus based upon mitochondrial DNA sequences
    Baric, S
    Salzburger, W
    Sturmbauer, C
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 2003, 56 (01) : 54 - 68
  • [3] A global initiative on sharing avian flu data
    Bogner, Peter
    Capua, Ilaria
    Cox, Nancy J.
    Lipman, David J.
    [J]. NATURE, 2006, 442 (7106) : 981 - 981
  • [4] Virus Variation Resource-recent updates and future directions
    Brister, J. Rodney
    Bao, Yiming
    Zhdanov, Sergey A.
    Ostapchuck, Yuri
    Chetvernin, Vyacheslav
    Kiryutin, Boris
    Zaslavsky, Leonid
    Kimelman, Michael
    Tatusova, Tatiana A.
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D660 - D665
  • [5] Integrative clustering of multi-level 'omic data based on non-negative matrix factorization algorithm
    Chalise, Prabhakar
    Fridley, Brooke L.
    [J]. PLOS ONE, 2017, 12 (05):
  • [6] Bayesian nonparametric clustering in phylogenetics: modeling antigenic evolution in influenza
    Cybis, Gabriela B.
    Sinsheimer, Janet S.
    Bedford, Trevor
    Rambaut, Andrew
    Lemey, Philippe
    Suchard, Marc A.
    [J]. STATISTICS IN MEDICINE, 2018, 37 (02) : 195 - 206
  • [7] A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps
    Delgado, Soledad
    Moran, Federico
    Mora, Antonio
    Julian Merelo, Juan
    Briones, Carlos
    [J]. BIOINFORMATICS, 2015, 31 (05) : 736 - 744
  • [8] Domain identification by clustering sequence alignments
    Guan, XJ
    Du, L
    [J]. BIOINFORMATICS, 1998, 14 (09) : 783 - 788
  • [9] An overview of the wcd EST clustering tool
    Hazelhurst, Scott
    Hide, Winston
    Liptak, Zsuzsanna
    Nogueira, Ramon
    Starfield, Richard
    [J]. BIOINFORMATICS, 2008, 24 (13) : 1542 - 1546
  • [10] KABOOM! A new suffix array based algorithm for clustering expression data
    Hazelhurst, Scott
    Liptak, Zsuzsanna
    [J]. BIOINFORMATICS, 2011, 27 (24) : 3348 - 3355