A classification-based prediction model of messenger RNA polyadenylation sites

被引:30
作者
Ji, Guoli [1 ]
Wu, Xiaohui [1 ]
Shen, Yingjia [2 ]
Huang, Jiangyin [1 ]
Li, Qingshun Quinn [2 ]
机构
[1] Xiamen Univ, Dept Automat, Xiamen 361000, Peoples R China
[2] Miami Univ, Dept Bot, Oxford, OH 45056 USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Arabidopsis; Classification-based modeling; Genome annotation; Polyadenylation; Predictive modeling; AMINO-ACID-COMPOSITION; PROTEIN STRUCTURAL CLASSES; SUBCELLULAR LOCATION PREDICTION; SUPPORT VECTOR MACHINE; ALTERNATIVE POLYADENYLATION; CHLAMYDOMONAS-REINHARDTII; WEB-SERVER; RECOGNITION; SIGNALS; GENOME;
D O I
10.1016/j.jtbi.2010.05.015
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Messenger RNA polyadenylation is one of the essential processing steps during eukaryotic gene expression. The site of polyadenylation [(poly(A) site] marks the end of a transcript, which is also the end of a gene. A computation program that is able to recognize poly(A) sites would not only prove useful for genome annotation in finding genes ends, but also for predicting alternative poly(A) sites. Features that define the poly(A) sites can now be extracted from the poly(A) site datasets to build such predictive models. Using methods, including K-gram pattern, Z-curve, position-specific scoring matrix and first-order inhomogeneous Markov sub-model, numerous features were generated and placed in an original feature space. To select the most useful features, attribute selection algorithms, such as information gain and entropy, were employed. A training model was then built based on the Bayesian network to determine a subset of the optimal features. Test models corresponding to the training models were built to predict poly(A) sites in Arabidopsis and rice. Thus, a prediction model, termed Poly(A) site classifier, or PAC, was constructed. The uniqueness of the model lies in its structure in that each sub-model can be replaced or expanded, while feature generation, selection and classification are all independent processes. Its modular design makes it easily adaptable to different species or datasets. The algorithm's high specificity and sensitivity were demonstrated by testing several datasets and, at the best combinations, they both reached 95%. The software package may be used for genome annotation and optimizing transgene structure. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:287 / 296
页数:10
相关论文
共 49 条