Recognition models to predict DNA-binding specificities of homeodomain proteins

被引:34
作者
Christensen, Ryan G. [1 ]
Enuameh, Metewo Selase [2 ]
Noyes, Marcus B. [2 ,3 ]
Brodsky, Michael H. [2 ,4 ]
Wolfe, Scot A. [2 ,3 ]
Stormo, Gary D. [1 ]
机构
[1] Washington Univ, Sch Med, Dept Genet, St Louis, MO 63108 USA
[2] Univ Massachusetts, Sch Med, Program Gene Funct & Express, Worcester, MA 01605 USA
[3] Univ Massachusetts, Sch Med, Dept Biochem & Mol Pharmacol, Worcester, MA 01605 USA
[4] Univ Massachusetts, Sch Med, Dept Mol Med, Worcester, MA 01605 USA
基金
美国国家卫生研究院;
关键词
ENGRAILED HOMEODOMAIN; ZINC FINGERS; CRYSTAL-STRUCTURE; CODE; RESOLUTION; COMPLEX; SITES; INTERFACES; SELECTION; REVEALS;
D O I
10.1093/bioinformatics/bts202
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C2H2 zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor (KNN) algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain (HD) proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes. Results: Using extensive experimental data, we have tested several machine learning approaches and find that both support vector machines and random forests (RFs) can produce recognition models for HD proteins that are significant improvements over KNN-based methods. Cross-validation analyses show that the resulting models are capable of predicting specificities with high accuracy. We have produced a web-based prediction tool, PreMoTF (Predicted Motifs for Transcription Factors) (http://stormo.wustl.edu/PreMoTF), for predicting position frequency matrices from protein sequence using a RF-based model.
引用
收藏
页码:I84 / I89
页数:6
相关论文
共 52 条
[1]   SPECIFICITY OF MINOR-GROOVE AND MAJOR-GROOVE INTERACTIONS IN A HOMEODOMAIN-DNA COMPLEX [J].
ADES, SE ;
SAUER, RT .
BIOCHEMISTRY, 1995, 34 (44) :14601-14608
[2]   Predicting the binding preference of transcription factors to individual DNA k-mers [J].
Alleyne, Trevis M. ;
Pena-Castillo, Lourdes ;
Badis, Gwenael ;
Talukder, Shaheynoor ;
Berger, Michael F. ;
Gehrke, Andrew R. ;
Philippakis, Anthony A. ;
Bulyk, Martha L. ;
Morris, Quaid D. ;
Hughes, Timothy R. .
BIOINFORMATICS, 2009, 25 (08) :1012-1018
[3]  
[Anonymous], PAC S BIOCOMPUT
[4]  
[Anonymous], 2011, ACM T INTEL SYST TEC, DOI DOI 10.1145/1961189.1961199
[5]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[6]   Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins [J].
Bateman, A ;
Birney, E ;
Durbin, R ;
Eddy, SR ;
Finn, RD ;
Sonnhammer, ELL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :260-262
[7]   Is there a code for protein-DNA recognition? Probab(ilistical)ly ... [J].
Benos, PV ;
Lapedes, AS ;
Stormo, GD .
BIOESSAYS, 2002, 24 (05) :466-475
[8]   Probabilistic code for DNA recognition by proteins of the EGR family [J].
Benos, PV ;
Lapedes, AS ;
Stormo, GD .
JOURNAL OF MOLECULAR BIOLOGY, 2002, 323 (04) :701-727
[9]   Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences [J].
Berger, Michael F. ;
Badis, Gwenael ;
Gehrke, Andrew R. ;
Talukder, Shaheynoor ;
Philippakis, Anthony A. ;
Pena-Castillo, Lourdes ;
Alleyne, Trevis M. ;
Mnaimneh, Sanie ;
Botvinnik, Olga B. ;
Chan, Esther T. ;
Khalid, Faiqua ;
Zhang, Wen ;
Newburger, Daniel ;
Jaeger, Savina A. ;
Morris, Quaid D. ;
Bulyk, Martha L. ;
Hughes, Timothy R. .
CELL, 2008, 133 (07) :1266-1276
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32