Words in DNA sequences: some case studies based on their frequency statistics

被引:9
作者
Basu, S
Burma, DP
Chaudhuri, P
机构
[1] Indian Stat Inst, Theoret Stat & Math Unit, Kolkata 700108, W Bengal, India
[2] Banaras Hindu Univ, Inst Med Sci, Mol Biol Unit, Varanasi 221005, Uttar Pradesh, India
关键词
average linkage clustering; Chernoff's faces; dendrograms; DNA words; F-ranks of words; F-ratios of words; l(1-)distance; phylogenetic relationships; rank correlation; single linkage clustering;
D O I
10.1007/s00285-002-0185-3
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
One of the critical requirements of data analysis involving large DNA sequences is an effective statistical summarization of those sequences. In this article DNA sequences have been analyzed based on word frequencies. Our analysis focuses on the detection of structural signature of a genome reflected in word frequencies and identification of phylogenetic relationships among different species reflected in the variation of word distributions in their DNA sequences. We have carried out a statistical study of the complete genome of baker's yeast, of various ribosomal RNA sequences from different prokaryotic and eukaryotic organisms and of the full genomes of some bacteriophages. Our exploratory analysis amply demonstrates the usefulness of DNA word frequencies in reducing the dimensionality of large sequences while retaining some of the structural information there that can have biological significance. Some conceptual issues that arise in course of our investigation have been addressed. A few interesting problems related to the statistics of DNA words have been pointed out with some indication of their possible solutions. The work has been partially motivated by the fact that sequence alignment and homology techniques that are quite popular for comparing and analyzing relatively smaller DNA sequences of nearly equal sizes are not applicable to data consisting of large sequences with widely varying sizes, which may contain segments with unknown or no biological functions, and consequently their comparison through functional homology is either impossible or extremely difficult.
引用
收藏
页码:479 / 503
页数:25
相关论文
共 48 条
[1]   Poisson process approximation for sequence repeats, and sequencing by hybridization [J].
Arratia, R ;
Martin, D ;
Reinert, G ;
Waterman, MS .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1996, 3 (03) :425-463
[2]  
Arratia R., 1990, STAT SCI, P403, DOI [10.1214/ss/1177012015, DOI 10.1214/SS/1177012015]
[3]   Similarities and dissimilarities of phage genomes [J].
Blaisdell, BE ;
Campbell, AM ;
Karlin, S .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1996, 93 (12) :5854-5859
[4]   SWORDS: A statistical tool for analysing large DNA sequences [J].
Chaudhuri, P ;
Das, S .
JOURNAL OF BIOSCIENCES, 2002, 27 (01) :1-6
[5]  
Chaudhuri P, 2001, CURR SCI INDIA, V80, P1161
[6]   USE OF FACES TO REPRESENT POINTS IN K-DIMENSIONAL SPACE GRAPHICALLY [J].
CHERNOFF, H .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1973, 68 (342) :361-368
[7]  
CHURCHILL GA, 1989, B MATH BIOL, V51, P79
[8]  
DEVEREUX J, 1995, NUCLEIC ACIDS RES, V12, P387
[9]  
Doolittle R., 1996, METHOD ENZYMOL, V266, P1
[10]  
DOOLITTLE RF, 1990, METHOD ENZYMOL, V183, P1