FINITE-STATE MODELS IN THE ALIGNMENT OF MACROMOLECULES

被引:43
作者
ALLISON, L
WALLACE, CS
YEE, CN
机构
[1] Department of Computer Science, Monash University
关键词
ALIGNMENT; EDIT DISTANCE; HOMOLOGY; INDUCTIVE INFERENCE; MINIMUM MESSAGE LENGTH; SIMILARITY; STRING;
D O I
10.1007/BF00160262
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages. It allows the posterior odds-ratio of two theories or hypotheses to be calculated. Here it is applied to problems of aligning or relating two strings, in particular two biological macromolecules. We compare the r-theory, that the strings are related, with the null-theory, that they are not related. If they are related, the probabilities of the various alignments can be calculated. This is done for one-, three-, and five-state models of relation or mutation. These correspond to linear and piecewise linear cost functions on runs of insertions and deletions. We describe how to estimate parameters of a model. The validity of a model is itself an hypothesis and can be objectively tested. This is done on real DNA strings and on artificial data. The tests on artificial data indicate limits on what can be inferred in various situations. The tests on real DNA support either the three- or five-state models over the one-state model. Finally, a fast, approximate minimum message length string comparison algorithm is described.
引用
收藏
页码:77 / 89
页数:13
相关论文
共 40 条
[1]   MINIMUM MESSAGE LENGTH ENCODING AND THE COMPARISON OF MACROMOLECULES [J].
ALLISON, L ;
YEE, CN .
BULLETIN OF MATHEMATICAL BIOLOGY, 1990, 52 (03) :431-453
[2]  
ALLISON L, 1992, IN PRESS HAWAII INT
[3]  
ALLISON L, 1990, P ARTIFICIAL INTELLI
[4]   THE MULTIPLE ORIGINS OF HUMAN ALU SEQUENCES [J].
BAINS, W .
JOURNAL OF MOLECULAR EVOLUTION, 1986, 23 (03) :189-199
[5]  
Bishop M.J., 1986, P150
[6]  
BISHOP MJ, 1987, NUCLEIC ACID PROTEIN, P359
[7]   INFORMATION MEASURE FOR HIERARCHICAL CLASSIFICATION [J].
BOULTON, DM ;
WALLACE, CS .
COMPUTER JOURNAL, 1973, 16 (03) :254-261
[8]   INFORMATION CONTENT OF A MULTISTATE DISTRIBUTION [J].
BOULTON, DM ;
WALLACE, CS .
JOURNAL OF THEORETICAL BIOLOGY, 1969, 23 (02) :269-+
[9]   ON LENGTH OF PROGRAMS FOR COMPUTING FINITE BINARY SEQUENCES [J].
CHAITIN, GJ .
JOURNAL OF THE ACM, 1966, 13 (04) :547-+
[10]  
COHEN D N, 1975, Mathematical Biosciences, V24, P25, DOI 10.1016/0025-5564(75)90064-4