Identifying transcription factor-DNA interactions using machine learning

被引:3
作者
Bang, Sohyun [1 ]
Galli, Mary [2 ]
Crisp, Peter A. [3 ]
Gallavotti, Andrea [2 ]
Schmitz, Robert J. [4 ]
机构
[1] Univ Georgia, Inst Bioinformat, Athens, GA 30602 USA
[2] Rutgers State Univ, Waksman Inst Microbiol, Piscataway, NJ 08834 USA
[3] Univ Queensland, Sch Agr & Food Sci, St Lucia, Qld 4072, Australia
[4] Univ Georgia, Dept Genet, Athens, GA 30602 USA
来源
IN SILICO PLANTS | 2022年 / 4卷 / 02期
基金
美国国家科学基金会;
关键词
Auxin response factors; DAP-seq; epigenomics; machine learning; TF-DNA interaction; FACTOR-BINDING; GENE-EXPRESSION; IMBALANCED DATA; REGULATORY DNA; GENOME; IDENTIFICATION; SPECIFICITY; INITIATION; ENHANCERS; DOMAINS;
D O I
10.1093/insilicoplants/diac014
中图分类号
S3 [农学(农艺学)];
学科分类号
0901 ;
摘要
Machine learning approaches have been applied to identify transcription factor (TF)-DNA interaction important for gene regulation and expression. However, due to the enormous search space of the genome, it is challenging to build models capable of surveying entire reference genomes, especially in species where models were not trained. In this study, we surveyed a variety of methods for classification of epigenomics data in an attempt to improve the detection for 12 members of the auxin response factor (ARF)-binding DNAs from maize and soybean as assessed by DNA Affinity Purification and sequencing (DAP-seq). We used the classification for prediction by minimizing the genome search space by only surveying unmethylated regions (UMRs). For identification of DAP-seq-binding events within the UMRs, we achieved 78.72 % accuracy rate across 12 members of ARFs of maize on average by encoding DNA with count vectorization for k-mer with a logistic regression classifier with up-sampling and feature selection. Importantly, feature selection helps to uncover known and potentially novel ARF-binding motifs. This demonstrates an independent method for identification of TF-binding sites. Finally, we tested the model built with maize DAP-seq data and applied it directly to the soybean genome and found high false-negative rates, which accounted for more than 40 % across the ARF TFs tested. The findings in this study suggest the potential use of various methods to predict TF-DNA interactions within and between species with varying degrees of success.
引用
收藏
页数:15
相关论文
共 67 条
  • [1] Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
    Alipanahi, Babak
    Delong, Andrew
    Weirauch, Matthew T.
    Frey, Brendan J.
    [J]. NATURE BIOTECHNOLOGY, 2015, 33 (08) : 831 - +
  • [2] The cis-regulatory codes of response to combined heat and drought stress in Arabidopsis thaliana
    Azodi, Christina B.
    Lloyd, John P.
    Shiu, Shin-Han
    [J]. NAR GENOMICS AND BIOINFORMATICS, 2020, 2 (03)
  • [3] Mapping genome-wide transcription-factor binding sites using DAP-seq
    Bartlett, Anna
    O'Malley, Ronan C.
    Huang, Shao-shan Carol
    Galli, Mary
    Nery, Joseph R.
    Gallavotti, Andrea
    Ecker, Joseph R.
    [J]. NATURE PROTOCOLS, 2017, 12 (08) : 1659 - 1672
  • [4] Trimmomatic: a flexible trimmer for Illumina sequence data
    Bolger, Anthony M.
    Lohse, Marc
    Usadel, Bjoern
    [J]. BIOINFORMATICS, 2014, 30 (15) : 2114 - 2120
  • [5] Brown G., 2011, R Package Version
  • [6] Functional and Mechanistic Diversity of Distal Transcription Enhancers
    Bulger, Michael
    Groudine, Mark
    [J]. CELL, 2011, 144 (03) : 327 - 339
  • [7] Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data
    Carroll, Thomas S.
    Liang, Ziwei
    Salama, Rafik
    Stark, Rory
    de Santiago, Ines
    [J]. FRONTIERS IN GENETICS, 2014, 5
  • [8] Auxin response factors
    Chandler, John William
    [J]. PLANT CELL AND ENVIRONMENT, 2016, 39 (05) : 1014 - 1028
  • [9] Dating the monocot-dicot divergence and the origin of core eudicots using whole chloroplast genomes
    Chaw, SM
    Chang, CC
    Chen, HL
    Li, WH
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 2004, 58 (04) : 424 - 441
  • [10] Understanding transcriptional regulation by integrative analysis of transcription factor binding data
    Cheng, Chao
    Alexander, Roger
    Min, Renqiang
    Leng, Jing
    Yip, Kevin Y.
    Rozowsky, Joel
    Yan, Koon-Kiu
    Dong, Xianjun
    Djebali, Sarah
    Ruan, Yijun
    Davis, Carrie A.
    Carninci, Piero
    Lassman, Timo
    Gingerasi, Thomas R.
    Guigo, Roderic
    Birney, Ewan
    Weng, Zhiping
    Snyder, Michael
    Gerstein, Mark
    [J]. GENOME RESEARCH, 2012, 22 (09) : 1658 - 1667