A Graph-Theoretical Approach for Motif Discovery in Protein Sequences

被引:5
作者
Czeizler, Elena [1 ]
Hirvola, Tommi [2 ]
Karhu, Kalle [2 ]
机构
[1] Aalto Univ, Dept Comp Sci, HIIT, Espoo, Finland
[2] Aalto Univ, Dept Comp Sci, Espoo, Finland
关键词
Motif recognition; protein motifs; gapped motifs; protein sequences; bioinformatics; SUBSTITUTION MATRICES; DATABASE; SITES;
D O I
10.1109/TCBB.2015.2511750
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motif recognition is a challenging problem in bioinformatics due to the diversity of protein motifs. Many existing algorithms identify motifs of a given length, thus being either not applicable or not efficient when searching simultaneously for motifs of various lengths. Searching for gapped motifs, although very important, is a highly time-consuming task due to the combinatorial explosion of possible combinations implied by the consideration of long gaps. We introduce a new graph theoretical approach to identify motifs of various lengths, both with and without gaps. We compare our approach with two widely used methods: MEME and GLAM2 analyzing both the quality of the results and the required computational time. Our method provides results of a slightly higher level of quality than MEME but at a much faster rate, i.e., one eighth of MEME's query time. By using similarity indexing, we drop the query times down to an average of approximately one sixth of the ones required by GLAM2, while achieving a slightly higher level of quality of the results. More precisely, for sequence collections smaller than 50,000 bytes GLAM2 is 13 times slower, while being at least as fast as our method on larger ones. The source code of our C++ implementation is freely available in GitHub: https://github.com/hirvolt1/debruijn-motif.
引用
收藏
页码:121 / 130
页数:10
相关论文
共 28 条
  • [1] Bailey T L, 1994, Proc Int Conf Intell Syst Mol Biol, V2, P28
  • [2] BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379
  • [3] Baussand J, 2008, EVOL BIOINFORM, V4, P255
  • [4] Approaches to the automatic discovery of patterns in biosequences
    Brazma, A
    Jonassen, I
    Eidhammer, I
    Gilbert, D
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (02) : 279 - 305
  • [5] Buhler Jeremy., 2001, J COMPUT BIOL, P69
  • [6] A survey of DNA motif finding algorithms
    Das, Modan K.
    Dai, Ho-Kwok
    [J]. BMC BIOINFORMATICS, 2007, 8 (Suppl 7)
  • [7] SIMPLIFIED STATISTICS FOR SMALL NUMBERS OF OBSERVATIONS
    DEAN, RB
    DIXON, WJ
    [J]. ANALYTICAL CHEMISTRY, 1951, 23 (04) : 636 - 638
  • [8] SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins
    Edwards, Richard J.
    Davey, Norman E.
    Shields, Denis C.
    [J]. PLOS ONE, 2007, 2 (10):
  • [9] Evaluating deterministic motif significance measures in protein databases
    Ferreira, Pedro Gabriel
    Azevedo, Paulo J.
    [J]. ALGORITHMS FOR MOLECULAR BIOLOGY, 2007, 2
  • [10] MotifCut: regulatory motifs finding with maximum density subgraphs
    Fratkin, Eugene
    Naughton, Brian T.
    Brutlag, Douglas L.
    Batzoglou, Serafim
    [J]. BIOINFORMATICS, 2006, 22 (14) : E150 - E157