Database indexing for large DNA and protein sequence collections

被引:30
作者
Hunt, E [1 ]
Atkinson, MP [1 ]
Irving, RW [1 ]
机构
[1] Univ Glasgow, Dept Comp Sci, Glasgow G12 8QQ, Lanark, Scotland
关键词
database index; suffix tree; approximate matching; biological sequence;
D O I
10.1007/s007780200064
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200 Mb of protein and 300 Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.
引用
收藏
页码:256 / 271
页数:16
相关论文
共 75 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[3]   Suffix trees on words [J].
Andersson, A ;
Larsson, NJ ;
Swanson, K .
ALGORITHMICA, 1999, 23 (03) :246-260
[4]   EFFICIENT IMPLEMENTATION OF SUFFIX TREES [J].
ANDERSSON, A ;
NILSSON, S .
SOFTWARE-PRACTICE & EXPERIENCE, 1995, 25 (02) :129-141
[5]  
[Anonymous], LECT NOTES COMPUTER
[6]  
[Anonymous], RECOMB 99 P 3 ANN IN
[7]  
Atkinson M, 1999, LECT NOTES COMPUT SC, V1540, P1
[8]  
Atkinson M. P., 1996, SIGMOD Record, V25, P68, DOI 10.1145/245882.245905
[9]   Faster approximate string matching [J].
BaezaYates, R ;
Navarro, G .
ALGORITHMICA, 1999, 23 (02) :127-158
[10]   A NEW APPROACH TO TEXT SEARCHING [J].
BAEZAYATES, R ;
GONNET, GH .
COMMUNICATIONS OF THE ACM, 1992, 35 (10) :74-82