CAE-CNN: Predicting transcription factor binding site with convolutional autoencoder and convolutional neural network

被引:16
作者
Zhang, Yongqing [1 ,2 ]
Qiao, Shaojie [3 ]
Zeng, Yuanqi [1 ]
Gao, Dongrui [1 ]
Han, Nan [4 ]
Zhou, Jiliu [1 ]
机构
[1] Chengdu Univ Informat Technol, Sch Comp Sci, Chengdu 610225, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[3] Chengdu Univ Informat Technol, Sch Software Engn, Chengdu 610225, Peoples R China
[4] Chengdu Univ Informat Technol, Sch Management, Chengdu 610103, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Transcription factor binding sites; Convolutional neural networks; Motif discovery; Bioinformatics; Autoencoder; CHIP-SEQ; DNA; IDENTIFICATION;
D O I
10.1016/j.eswa.2021.115404
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transcription factor binding site (TFBS) is a DNA sequence that binds to transcription factor and regulates the transcription process of the gene. Although deep learning algorithms are superior to traditional methods in predicting transcription factor binding site, they often rely too much on negative sample data, which cannot be verified by experiment. In particular, a training model with such negative samples can generate a lot of noisy data and affect the classification performance. In order to cope with the aforementioned drawbacks, we propose a new architecture by combining a convolutional autoencoder with convolutional neural network, which is called CAE-CNN (Convolutional AutoEncoder and Convolutional Neural Network). Specifically, motivated by the image reconstruction, we use a convolutional autoencoder to extract useful features from the positive samples in DNA nucleotides. Consequently, the learned features will be used by the convolutional neural network in the phase of training. Furthermore, we employ a highway connection layer to better capture the features of DNA nucleotides through a gated unit. Extensive experiments based on human and mouse TFBS datasets evaluate the effectiveness of the proposed method for the motif discovery task, outperforming the state-of-the-art methods in accuracy, precision, recall, and AUC value. To the best of our knowledge, the original contribution of this work lies in integrating unsupervised and supervised learning methods to study the TFBS, thereby being able to build a more robust and generative TFBS prediction model.
引用
收藏
页数:11
相关论文
共 37 条
  • [21] Shrikumar A., 2017, bioRxiv, DOI 10.1101/103663
  • [22] Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoder-DeepBreath
    Soleymani, Farzan
    Paquet, Eric
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2020, 156
  • [23] Srivastava N, 2014, J MACH LEARN RES, V15, P1929
  • [24] Srivastava RK., 2015, P 28 INT C NEURAL IN, P2377, DOI DOI 10.48550/ARXIV.1507.06228
  • [25] DNA binding sites: representation and discovery
    Stormo, GD
    [J]. BIOINFORMATICS, 2000, 16 (01) : 16 - 23
  • [26] NetSRE: Link predictability measuring and regulating
    Xian, Xingping
    Wu, Tao
    Qiao, Shaojie
    Wang, Xi-Zhao
    Wang, Wei
    Liu, Yanbing
    [J]. KNOWLEDGE-BASED SYSTEMS, 2020, 196 (196)
  • [27] Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework
    Yang, Jinyu
    Ma, Anjun
    Hoppe, Adam D.
    Wang, Cankun
    Li, Yang
    Zhang, Chi
    Wang, Yan
    Liu, Bingqiang
    Ma, Qin
    [J]. NUCLEIC ACIDS RESEARCH, 2019, 47 (15) : 7809 - 7824
  • [28] GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments
    Yevshin, Ivan
    Sharipov, Ruslan
    Valeev, Tagir
    Kel, Alexander
    Kolpakov, Fedor
    [J]. NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) : D61 - D67
  • [29] A parallel and constraint induced approach to modeling user preference from rating data
    Yue, Kun
    Wu, Xinran
    Duan, Liang
    Qiao, Shaojie
    Wu, Hao
    [J]. KNOWLEDGE-BASED SYSTEMS, 2020, 204
  • [30] Convolutional neural network architectures for predicting DNA-protein binding
    Zeng, Haoyang
    Edwards, Matthew D.
    Liu, Ge
    Gifford, David K.
    [J]. BIOINFORMATICS, 2016, 32 (12) : 121 - 127