PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

被引:744
作者
Lin, Michael F. [1 ,2 ]
Jungreis, Irwin [1 ,2 ]
Kellis, Manolis [1 ,2 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[2] Broad Inst, Cambridge, MA 02142 USA
基金
美国国家科学基金会;
关键词
LIKELIHOOD RATIO TESTS; SEQUENCE EVOLUTION; SUBSTITUTION; GENES; MODEL; DISTRIBUTIONS; REVEALS;
D O I
10.1093/bioinformatics/btr209
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.
引用
收藏
页码:I275 / I282
页数:8
相关论文
共 36 条
  • [1] Alioto T., 2009, P7
  • [2] Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
    Anisimova, Maria
    Kosiol, Carolin
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2009, 26 (02) : 255 - 271
  • [3] [Anonymous], 2004, Inferring phylogenies
  • [4] [Anonymous], 2007, NATURE, DOI [DOI 10.1038/NATURE06341, 10.1038/NATURE06341]
  • [5] [Anonymous], NATURE GENE IN PRESS
  • [6] [Anonymous], 1974, LOGIC STAT INFERENCE
  • [7] [Anonymous], SCIENCE IN PRESS
  • [8] [Anonymous], GENOME RES
  • [9] Estimation of reversible substitution matrices from multiple pairs of sequences
    Arvestad, L
    Bruno, WJ
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1997, 45 (06) : 696 - 703
  • [10] Aligning multiple genomic sequences with the threaded blockset aligner
    Blanchette, M
    Kent, WJ
    Riemer, C
    Elnitski, L
    Smit, AFA
    Roskin, KM
    Baertsch, R
    Rosenbloom, K
    Clawson, H
    Green, ED
    Haussler, D
    Miller, W
    [J]. GENOME RESEARCH, 2004, 14 (04) : 708 - 715