Engineering a fast online persistent suffix tree construction

被引:11
作者
Bedathur, SJ [1 ]
Haritsa, JR [1 ]
机构
[1] Indian Inst Sci, SERC, Database Syst Lab, Bangalore 560012, Karnataka, India
来源
20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2004年
关键词
D O I
10.1109/ICDE.2004.1320040
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Online persistent suffix tree construction has been considered impractical due to its excessive 110 costs. However these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior o construction and subsequent retrievals over the tree. In this paper we study these two issues in detail in the context of large genomic DNA and Protein sequences. In particular we make the following contributions: (i) a novel, low-overhead buffering policy called TOP-Q which improves the on-disk behavior of suffix tree construction and subsequent retrievals, and (ii) empirical evidence that the space efficient linked-list representation of suffix tree nodes provides significantly inferior performance when compared to the array representation. These results demonstrate that a careful choice of implementation strategies can make online persistent suffix tree construction considerably more scalable - in terms of length of sequences indexed with a fixed memory budget, than currently perceived.
引用
收藏
页码:720 / 731
页数:12
相关论文
共 22 条
  • [1] [Anonymous], 1997, ALGORITHMS STRINGS T, DOI DOI 10.1017/CBO9780511574931
  • [2] BIEGANSKI P, 1995, THESIS U MINNESOTA
  • [3] CHANG WI, 1990, P IEEE ANN S FDN COM
  • [4] Chou H., 1985, P 11 VLDB C
  • [5] EFFELSBERG W, 1984, ACM T DATABASE SYSTE, V9
  • [6] FARACH M, 1998, P IEEE ANN S FDN COM
  • [7] FARACHCOLTON M, 1997, P IEEE ANN S FDN COM
  • [8] GIEGERICH R, 1995, SCI PROGRAMMING
  • [9] GUSFIELD D, 2002, IEEE BIOINF C CSB
  • [10] HUNT E, 2000, P ECOOP