Into the heart of darkness: large-scale clustering of human non-coding DNA

被引:51
作者
Bejerano, Gill [1 ]
Haussler, David [1 ]
Blanchette, Mathieu [2 ]
机构
[1] Univ Calif Santa Cruz, Baskin Sch Engn, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
[2] 3775 Univ, McGill Ctr Bioinformat, Sch Comp Sci, Montreal, PQ H3A 2B4, Canada
关键词
D O I
10.1093/bioinformatics/bth946
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: It is currently believed that the human genome contains about twice as much non-coding functional regions as it does protein-coding genes, yet our understanding of these regions is very limited. Results: We examine the intersection between syntenically conserved sequences in the human, mouse and rat genomes, and sequence similarities within the human genome itself, in search of families of non-protein-coding elements. For this purpose we develop a graph theoretic clustering algorithm, akin to the highly successful methods used in elucidating protein sequence family relationships. The algorithm is applied to a highly filtered set of about 700 000 human-rodent evolutionarily conserved regions, not resembling any known coding sequence, which encompasses 3.7% of the human genome. From these, we obtain roughly 12 000 non-singleton clusters, dense in significant sequence similarities. Further analysis of genomic location, evidence of transcription and RNA secondary structure reveals many clusters to be significantly homogeneous in one or more characteristics. This subset of the highly conserved non-protein-coding elements in the human genome thus contains rich family-like structures, which merit in-depth analysis.
引用
收藏
页码:40 / 48
页数:9
相关论文
共 28 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   Recent segmental duplications in the human genome [J].
Bailey, JA ;
Gu, ZP ;
Clark, RA ;
Reinert, K ;
Samonte, RV ;
Schwartz, S ;
Adams, MD ;
Myers, EW ;
Li, PW ;
Eichler, EE .
SCIENCE, 2002, 297 (5583) :1003-1007
[3]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[4]   Phylogenetic shadowing of primate sequences to find functional regions of the human genome [J].
Boffelli, D ;
McAuliffe, J ;
Ovcharenko, D ;
Lewis, KD ;
Ovcharenko, I ;
Pachter, L ;
Rubin, EM .
SCIENCE, 2003, 299 (5611) :1391-1394
[5]  
Chiaromonte F., 2003, COLD SPRING HARBOR S, P68
[6]   Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs) [J].
Dermitzakis, ET ;
Reymond, A ;
Scamuffa, N ;
Ucla, C ;
Kirkness, E ;
Rossier, C ;
Antonarakis, SE .
SCIENCE, 2003, 302 (5647) :1033-1035
[7]   CORG: a database for COmparative Regulatory Genomics [J].
Dieterich, C ;
Wang, H ;
Rateitschak, K ;
Luz, H ;
Vingron, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :55-57
[8]   GeneRAGE: a robust algorithm for sequence clustering and domain detection [J].
Enright, AJ ;
Ouzounis, CA .
BIOINFORMATICS, 2000, 16 (05) :451-457
[9]  
Fiduccia C. M., 1982, P IEEE ACM DES AUT C, P175, DOI DOI 10.1109/DAC.1982.1585498
[10]   Rfam: an RNA family database [J].
Griffiths-Jones, S ;
Bateman, A ;
Marshall, M ;
Khanna, A ;
Eddy, SR .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :439-441