Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape

被引:33
作者
Dai, Hanjun [1 ]
Umarov, Ramzan [2 ]
Kuwahara, Hiroyuki [2 ]
Li, Yu [2 ]
Song, Le [1 ]
Gao, Xin [2 ]
机构
[1] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
[2] KAUST, CBRC, Comp Elect & Math Sci & Engn CEMSE Div, Thuwal 239556900, Saudi Arabia
关键词
PARAMETER-ESTIMATION; MASTER REGULATOR; GENE-EXPRESSION; DNA; PROTEIN; SITES; SPECIFICITIES; GCN4; MICROARRAYS; STARVATION;
D O I
10.1093/bioinformatics/btx480
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.
引用
收藏
页码:3575 / 3583
页数:9
相关论文
共 59 条
[21]   Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE [J].
Foat, Barrett C. ;
Morozov, Alexandre V. ;
Bussemaker, Harmen J. .
BIOINFORMATICS, 2006, 22 (14) :E141-E149
[22]   De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis [J].
Fordyce, Polly M. ;
Gerber, Doron ;
Tran, Danh ;
Zheng, Jiashun ;
Li, Hao ;
DeRisi, Joseph L. ;
Quake, Stephen R. .
NATURE BIOTECHNOLOGY, 2010, 28 (09) :970-976
[23]   Analysis of combinatorial cis-regulation in synthetic and genomic promoters [J].
Gertz, Jason ;
Siggia, Eric D. ;
Cohen, Barak A. .
NATURE, 2009, 457 (7226) :215-U113
[24]   LAC OPERATOR IS DNA [J].
GILBERT, W ;
MULLERHI.B .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1967, 58 (06) :2415-+
[25]  
Gonzalez J., 2009, ARTIFICIAL INTELLIGE
[26]  
Hassanzadeh H. R., 2016, ARXIV161105777
[27]   SATURATION MUTAGENESIS OF THE YEAST HIS3 REGULATORY SITE - REQUIREMENTS FOR TRANSCRIPTIONAL INDUCTION AND FOR BINDING BY GCN4 ACTIVATOR PROTEIN [J].
HILL, DE ;
HOPE, IA ;
MACKE, JP ;
STRUHL, K .
SCIENCE, 1986, 234 (4775) :451-457
[28]   Translational regulation of GCN4 and the general amino acid control of yeast [J].
Hinnebusch, AG .
ANNUAL REVIEW OF MICROBIOLOGY, 2005, 59 :407-450
[29]   Gcn4p, a master regulator of gene expression, is controlled at multiple levels by diverse signals of starvation and stress [J].
Hinnebusch, AG ;
Natarajan, K .
EUKARYOTIC CELL, 2002, 1 (01) :22-32
[30]   COOPERATIVE BINDING OF LAMBDA-REPRESSORS TO SITES SEPARATED BY INTEGRAL TURNS OF THE DNA HELIX [J].
HOCHSCHILD, A ;
PTASHNE, M .
CELL, 1986, 44 (05) :681-687