A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

被引:11
作者
Vanhaeren, Thomas [1 ]
Divina, Federico [1 ]
Garcia-Torres, Miguel [1 ]
Gomez-Vela, Francisco [1 ]
Vanhoof, Wim [2 ]
Manuel Martinez-Garcia, Pedro [3 ,4 ]
机构
[1] Univ Pablo de Olavide, Div Comp Sci, Seville 41013, Spain
[2] Univ Namur, Fac Comp Sci, B-5000 Namur, Belgium
[3] Univ Pablo de Olavide, Ctr Andaluz Biol Mol & Med Regenerat CABIMER, CSIC, Univ Sevilla, Seville 41092, Spain
[4] Univ Isabel I, Fac Ciencias & Tecnol, Burgos 09003, Spain
关键词
machine-learning; chromatin interactions; prediction; genomics; genome architecture; MAMMALIAN GENOMES; DOMAINS; ORGANIZATION; COHESIN; TRANSCRIPTION; ARCHITECTURE; PRINCIPLES; CTCF;
D O I
10.3390/genes11090985
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.
引用
收藏
页码:1 / 17
页数:17
相关论文
共 42 条
[1]   Three-dimensional Epigenome Statistical Model: Genome-wide Chromatin Looping Prediction [J].
Al Bkhetan, Ziad ;
Plewczynski, Dariusz .
SCIENTIFIC REPORTS, 2018, 8
[2]   Sequence and chromatin determinants of cell-type-specific transcription factor binding [J].
Arvey, Aaron ;
Agius, Phaedra ;
Noble, William Stafford ;
Leslie, Christina .
GENOME RESEARCH, 2012, 22 (09) :1723-1734
[3]   Genome Architecture: Domain Organization of Interphase Chromosomes [J].
Bickmore, Wendy A. ;
van Steensel, Bas .
CELL, 2013, 152 (06) :1270-1284
[4]   Organization and function of the 3D genome [J].
Bonev, Boyan ;
Cavalli, Giacomo .
NATURE REVIEWS GENETICS, 2016, 17 (11) :661-678
[5]   Getting the genome in shape: the formation of loops, domains and compartments [J].
Bouwman, Britta A. M. ;
de Laat, Wouter .
GENOME BIOLOGY, 2015, 16
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]  
Buitinck L, 2013, ARXIV13090238
[8]   Cohesin is positioned in mammalian genomes by transcription, CTCF and Wapl [J].
Busslinger, Georg A. ;
Stocsits, Roman R. ;
van der Lelij, Petra ;
Axelsson, Elin ;
Tedeschi, Antonio ;
Galjart, Niels ;
Peters, Jan-Michael .
NATURE, 2017, 544 (7651) :503-+
[9]  
Chang Yin-Wen, 2010, Journal of Machine Learning Research, V11
[10]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794