Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape

被引:33
作者
Dai, Hanjun [1 ]
Umarov, Ramzan [2 ]
Kuwahara, Hiroyuki [2 ]
Li, Yu [2 ]
Song, Le [1 ]
Gao, Xin [2 ]
机构
[1] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
[2] KAUST, CBRC, Comp Elect & Math Sci & Engn CEMSE Div, Thuwal 239556900, Saudi Arabia
关键词
PARAMETER-ESTIMATION; MASTER REGULATOR; GENE-EXPRESSION; DNA; PROTEIN; SITES; SPECIFICITIES; GCN4; MICROARRAYS; STARVATION;
D O I
10.1093/bioinformatics/btx480
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.
引用
收藏
页码:3575 / 3583
页数:9
相关论文
共 59 条
[31]  
Jaakkola TS, 1999, ADV NEUR IN, V11, P487
[32]   A framework for scalable parameter estimation of gene circuit models using structural information [J].
Kuwahara, Hiroyuki ;
Fan, Ming ;
Wang, Suojin ;
Gao, Xin .
BIOINFORMATICS, 2013, 29 (13) :98-107
[33]   Discriminative prediction of mammalian enhancers from DNA sequence [J].
Lee, Dongwon ;
Karchin, Rachel ;
Beer, Michael A. .
GENOME RESEARCH, 2011, 21 (12) :2167-2180
[34]  
Leslie C. S., 2002, PACIFIC S BIOCOMPUTI, V7, P566
[35]   Mismatch string kernels for discriminative protein classification [J].
Leslie, CS ;
Eskin, E ;
Cohen, A ;
Weston, J ;
Noble, WS .
BIOINFORMATICS, 2004, 20 (04) :467-476
[36]   Deep learning of the tissue-regulated splicing code [J].
Leung, Michael K. K. ;
Xiong, Hui Yuan ;
Lee, Leo J. ;
Frey, Brendan J. .
BIOINFORMATICS, 2014, 30 (12) :121-129
[37]   Unraveling determinants of transcription factor binding outside the core binding site [J].
Levo, Michal ;
Zalckvar, Einat ;
Sharon, Eilon ;
Machado, Ana Carolina Dantas ;
Kalma, Yael ;
Lotam-Pompan, Maya ;
Weinberger, Adina ;
Yakhini, Zohar ;
Rohs, Remo ;
Segal, Eran .
GENOME RESEARCH, 2015, 25 (07) :1018-1029
[38]   An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments [J].
Liu, XS ;
Brutlag, DL ;
Liu, JS .
NATURE BIOTECHNOLOGY, 2002, 20 (08) :835-839
[39]   Transcriptional profiling shows that Gcn4p is a master regulator of gene expression during amino acid starvation in yeast [J].
Natarajan, K ;
Meyer, MR ;
Jackson, BM ;
Slade, D ;
Roberts, C ;
Hinnebusch, AG ;
Marton, MJ .
MOLECULAR AND CELLULAR BIOLOGY, 2001, 21 (13) :4347-4368
[40]   Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument [J].
Nutiu, Razvan ;
Friedman, Robin C. ;
Luo, Shujun ;
Khrebtukova, Irina ;
Silva, David ;
Li, Robin ;
Zhang, Lu ;
Schroth, Gary P. ;
Burge, Christopher B. .
NATURE BIOTECHNOLOGY, 2011, 29 (07) :659-U146