Approaches to the automatic discovery of patterns in biosequences

被引:143
作者
Brazma, A
Jonassen, I [1 ]
Eidhammer, I
Gilbert, D
机构
[1] Univ Bergen, Dept Informat, HIB, N-5020 Bergen, Norway
[2] European Bioinformat Inst, EMBL Outstn, Cambridge CB10 1SD, England
[3] City Univ London, Dept Comp Sci, London EC1V 0HB, England
关键词
automatic discovery; bioinformatics; biosequences; machine learning; patterns;
D O I
10.1089/cmb.1998.5.279
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms, Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis is presented of the ways in which an assessment can be made of the significance of the discovered patterns, It is shown that the problem is related to problems studied in the field of machine learning, The major part of this paper comprises a review of a number of existing methods developed to solve the problem and how these relate to each other, focusing on the algorithms underlying the approaches, A comparison is given of the algorithms, and examples are given of patterns that have been discovered using the different methods.
引用
收藏
页码:279 / 305
页数:27
相关论文
共 77 条
[1]  
Aho A. V., 1983, DATA STRUCTURES ALGO
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]  
[Anonymous], 1978, ATLAS PROTEIN SEQUEN
[4]  
[Anonymous], [No title captured]
[5]   A MACHINE DISCOVERY FROM AMINO-ACID-SEQUENCES BY DECISION TREES OVER REGULAR PATTERNS [J].
ARIKAWA, S ;
MIYANO, S ;
SHINOHARA, A ;
KUHARA, S ;
MUKOUCHI, Y ;
SHINOHARA, T .
NEW GENERATION COMPUTING, 1993, 11 (3-4) :361-375
[6]  
ARIKAWA S, 1992, P I5 HICSS, P675
[7]  
ARIMURA H, 1994, P GEN INF WORKSH, P39
[8]  
BAILEY TL, 1995, THESIS U CALIFORNIA
[9]  
BAILEY TL, 1995, P 3 INT C INT SYST M, P21
[10]   PROSITE - A DICTIONARY OF SITES AND PATTERNS IN PROTEINS [J].
BAIROCH, A .
NUCLEIC ACIDS RESEARCH, 1992, 20 :2013-2018