CAE-CNN: Predicting transcription factor binding site with convolutional autoencoder and convolutional neural network

被引:16
作者
Zhang, Yongqing [1 ,2 ]
Qiao, Shaojie [3 ]
Zeng, Yuanqi [1 ]
Gao, Dongrui [1 ]
Han, Nan [4 ]
Zhou, Jiliu [1 ]
机构
[1] Chengdu Univ Informat Technol, Sch Comp Sci, Chengdu 610225, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[3] Chengdu Univ Informat Technol, Sch Software Engn, Chengdu 610225, Peoples R China
[4] Chengdu Univ Informat Technol, Sch Management, Chengdu 610103, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Transcription factor binding sites; Convolutional neural networks; Motif discovery; Bioinformatics; Autoencoder; CHIP-SEQ; DNA; IDENTIFICATION;
D O I
10.1016/j.eswa.2021.115404
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transcription factor binding site (TFBS) is a DNA sequence that binds to transcription factor and regulates the transcription process of the gene. Although deep learning algorithms are superior to traditional methods in predicting transcription factor binding site, they often rely too much on negative sample data, which cannot be verified by experiment. In particular, a training model with such negative samples can generate a lot of noisy data and affect the classification performance. In order to cope with the aforementioned drawbacks, we propose a new architecture by combining a convolutional autoencoder with convolutional neural network, which is called CAE-CNN (Convolutional AutoEncoder and Convolutional Neural Network). Specifically, motivated by the image reconstruction, we use a convolutional autoencoder to extract useful features from the positive samples in DNA nucleotides. Consequently, the learned features will be used by the convolutional neural network in the phase of training. Furthermore, we employ a highway connection layer to better capture the features of DNA nucleotides through a gated unit. Extensive experiments based on human and mouse TFBS datasets evaluate the effectiveness of the proposed method for the motif discovery task, outperforming the state-of-the-art methods in accuracy, precision, recall, and AUC value. To the best of our knowledge, the original contribution of this work lies in integrating unsupervised and supervised learning methods to study the TFBS, thereby being able to build a more robust and generative TFBS prediction model.
引用
收藏
页数:11
相关论文
共 37 条
  • [1] Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
    Alipanahi, Babak
    Delong, Andrew
    Weirauch, Matthew T.
    Frey, Brendan J.
    [J]. NATURE BIOTECHNOLOGY, 2015, 33 (08) : 831 - +
  • [2] Deep learning
    LeCun, Yann
    Bengio, Yoshua
    Hinton, Geoffrey
    [J]. NATURE, 2015, 521 (7553) : 436 - 444
  • [3] [Anonymous], 2010, ICML
  • [4] Enhancing deep learning sentiment analysis with ensemble techniques in social applications
    Araque, Oscar
    Corcuera-Platas, Ignacio
    Sanchez-Rada, J. Fernando
    Iglesias, Carlos A.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 77 : 236 - 246
  • [5] Bailey T L, 1995, Proc Int Conf Intell Syst Mol Biol, V3, P21
  • [6] Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN
    Chen, Tao
    Xu, Ruifeng
    He, Yulan
    Wang, Xuan
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 72 : 221 - 230
  • [7] An integrated encyclopedia of DNA elements in the human genome
    Dunham, Ian
    Kundaje, Anshul
    Aldred, Shelley F.
    Collins, Patrick J.
    Davis, CarrieA.
    Doyle, Francis
    Epstein, Charles B.
    Frietze, Seth
    Harrow, Jennifer
    Kaul, Rajinder
    Khatun, Jainab
    Lajoie, Bryan R.
    Landt, Stephen G.
    Lee, Bum-Kyu
    Pauli, Florencia
    Rosenbloom, Kate R.
    Sabo, Peter
    Safi, Alexias
    Sanyal, Amartya
    Shoresh, Noam
    Simon, Jeremy M.
    Song, Lingyun
    Trinklein, Nathan D.
    Altshuler, Robert C.
    Birney, Ewan
    Brown, James B.
    Cheng, Chao
    Djebali, Sarah
    Dong, Xianjun
    Dunham, Ian
    Ernst, Jason
    Furey, Terrence S.
    Gerstein, Mark
    Giardine, Belinda
    Greven, Melissa
    Hardison, Ross C.
    Harris, Robert S.
    Herrero, Javier
    Hoffman, Michael M.
    Iyer, Sowmya
    Kellis, Manolis
    Khatun, Jainab
    Kheradpour, Pouya
    Kundaje, Anshul
    Lassmann, Timo
    Li, Qunhua
    Lin, Xinying
    Marinov, Georgi K.
    Merkel, Angelika
    Mortazavi, Ali
    [J]. NATURE, 2012, 489 (7414) : 57 - 74
  • [8] An efficient algorithm for improving structure-based prediction of transcription factor binding sites
    Farrel, Alvin
    Guo, Jun-tao
    [J]. BMC BIOINFORMATICS, 2017, 18
  • [9] ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions
    Furey, Terrence S.
    [J]. NATURE REVIEWS GENETICS, 2012, 13 (12) : 840 - 852
  • [10] High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints
    Guo, Yuchun
    Mahony, Shaun
    Gifford, David K.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2012, 8 (08)