LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions

被引:1
作者
Tahir, Muhammad [1 ]
Khan, Shehroz S. [2 ]
Davie, James [3 ]
Yamanaka, Soichiro [4 ]
Ashraf, Ahmed [1 ]
机构
[1] Univ Manitoba, Dept Elect & Comp Engn, Winnipeg, MB R3T 5V6, Canada
[2] American Univ Middle East, Coll Engn & Technol, Kuwait, Kuwait
[3] Univ Manitoba, Dept Biochem & Med Genet, Max Rady Coll Med, Rady Fac Hlth Sci, Winnipeg, MB, Canada
[4] Univ Tokyo, Grad Sch Sci, Dept Biophys & Biochem, Tokyo, Japan
基金
加拿大健康研究院;
关键词
Enhancer-promoter interactions; Hybrid features; DNA sequences; Deep neural networks; SEQUENCE-BASED PREDICTOR; GENOME; SITES;
D O I
10.1007/s10489-024-05848-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is a significant body of work to develop methods for understanding Enhancer-Promoter Interactions (EPI) from genetic and epigenomic marks. Over the last decade, several machine learning and deep learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches perform analysis by randomly splitting the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting inadvertently causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets. As a result, it has been pointed out in the literature that the performance of EPI prediction algorithms is overestimated because of genomic region overlap among the training and testing parts of the data. Building on that, in this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI prediction. LOCO has been used in other bioinformatics contexts and ensures that there is no genomic overlap between training and testing sets enabling more fair estimation of performance. We demonstrate that a deep learning algorithm which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, showing overestimation of performance in previous literature. We also propose a novel hybrid multi-branch neural network architecture for EPI prediction. In particular, our architecture has one branch consisting of a deep neural network, while the other branch extracts traditional k-mer features derived from the nucleotide sequence. The two branches are later merged and the neural network is trained jointly to force the network to learn feature representations which are already not covered by k-mer features. We show that the hybrid architecture performs significantly better in a realistic and fair LOCO testing paradigm, demonstrating it can learn more general aspects of EP interactions instead of overfitting to genomic regions. Through this paper we are also releasing the LOCO splitting-based EPI dataset to encourage other research groups to benchmark their EPI algorithms using a consistent LOCO paradigm. Research data is available in this public repository: https://github.com/malikmtahir/EPI
引用
收藏
页数:16
相关论文
共 60 条
[1]   Disruption of the 3D cancer genome blueprint [J].
Achinger-Kawecka, Joanna ;
Clark, Susan J. .
EPIGENOMICS, 2017, 9 (01) :47-55
[2]   EPI-Trans: an effective transformer-based deep learning model for enhancer promoter interaction prediction [J].
Ahmed, Fatma S. ;
Aly, Saleh ;
Liu, Xiangrong .
BMC BIOINFORMATICS, 2024, 25 (01)
[3]  
Ashraf A., 2018, Learning to unlearn: Building immunity to dataset bias in medical imaging studies
[4]   Quantitative prediction of enhancer-promoter interactions [J].
Belokopytova, Polina S. ;
Nuriddinov, Miroslav A. ;
Mozheiko, Evgeniy A. ;
Fishman, Daniil ;
Fishman, Veniamin .
GENOME RESEARCH, 2020, 30 (01) :72-84
[5]   Polymer Simulations of Heteromorphic Chromatin Predict the 3D Folding of Complex Genomic Loci [J].
Buckle, Adam ;
Brackley, Chris A. ;
Boyle, Shelagh ;
Marenduzzo, Davide ;
Gilbert, Nick .
MOLECULAR CELL, 2018, 72 (04) :786-+
[6]   Responsible, practical genomic data sharing that accelerates research [J].
Byrd, James Brian ;
Greene, Anna C. ;
Prasad, Deepashree Venkatesh ;
Jiang, Xiaoqian ;
Greene, Casey S. .
NATURE REVIEWS GENETICS, 2020, 21 (10) :615-629
[7]   Systematic identification of conserved motif modules in the human genome [J].
Cai, Xiaohui ;
Hou, Lin ;
Su, Naifang ;
Hu, Haiyan ;
Deng, Minghua ;
Li, Xiaoman .
BMC GENOMICS, 2010, 11
[8]   Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks [J].
Chen, Junjie ;
Mowlaei, Mohammad Erfan ;
Shi, Xinghua .
ACM-BCB 2020 - 11TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2020,
[9]   iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition [J].
Chen, Wei ;
Feng, Peng-Mian ;
Lin, Hao ;
Chou, Kuo-Chen .
NUCLEIC ACIDS RESEARCH, 2013, 41 (06) :e68
[10]   De novo deciphering three-dimensional chromatin interaction and topological domains by wavelet transformation of epigenetic profiles [J].
Chen, Yong ;
Wang, Yunfei ;
Xuan, Zhenyu ;
Chen, Min ;
Zhang, Michael Q. .
NUCLEIC ACIDS RESEARCH, 2016, 44 (11)