An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences

被引:19
作者
Ye, Kai [1 ]
Kosters, Walter A.
Ijzerman, Adriaan P.
机构
[1] Ctr Drug Res, Div Med Chem, Leiden, Netherlands
[2] Leiden Univ, Inst Adv Comp Sci, Leiden, Netherlands
关键词
D O I
10.1093/bioinformatics/btl665
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. Results: In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets.
引用
收藏
页码:687 / 693
页数:7
相关论文
共 21 条
[1]  
Agrawal R., 1994, Proceedings of the 20th International Conference on Very Large Data Bases. VLDB'94, P487
[2]   PRINTS and its automatic supplement, prePRINTS [J].
Attwood, TK ;
Bradley, P ;
Flower, DR ;
Gaulton, A ;
Maudling, N ;
Mitchell, AL ;
Moulton, G ;
Nordle, A ;
Paine, K ;
Taylor, P ;
Uddin, A ;
Zygouri, C .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :400-402
[3]  
Baldi P, 1994, J Comput Biol, V1, P311, DOI 10.1089/cmb.1994.1.311
[4]   HIDDEN MARKOV-MODELS OF BIOLOGICAL PRIMARY SEQUENCE INFORMATION [J].
BALDI, P ;
CHAUVIN, Y ;
HUNKAPILLER, T ;
MCCLURE, MA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1994, 91 (03) :1059-1063
[5]  
Bateman A, 2002, NUCLEIC ACIDS RES, V30, P276, DOI [10.1093/nar/gkr1065, 10.1093/nar/gkp985, 10.1093/nar/gkh121]
[6]   Sialidase-like Asp-boxes: Sequence-similar structures within different protein folds [J].
Copley, RR ;
Russell, RB ;
Ponting, CP .
PROTEIN SCIENCE, 2001, 10 (02) :285-292
[7]   Mining sequential patterns with regular expression constraints [J].
Garofalakis, M ;
Rastogi, R ;
Shim, K .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (03) :530-552
[8]   The PROSITE database [J].
Hulo, Nicolas ;
Bairoch, Amos ;
Bulliard, Virginie ;
Cerutti, Lorenzo ;
De Castro, Edouard ;
Langendijk-Genevaux, Petra S. ;
Pagni, Marco ;
Sigrist, Christian J. A. .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D227-D230
[9]  
Jonassen I, 1997, COMPUT APPL BIOSCI, V13, P509
[10]   FINDING FLEXIBLE PATTERNS IN UNALIGNED PROTEIN SEQUENCES [J].
JONASSEN, I ;
COLLINS, JF ;
HIGGINS, DG .
PROTEIN SCIENCE, 1995, 4 (08) :1587-1595