Prediction of Super-enhancers Based on Mean-shift Undersampling

被引：1

作者：

Cheng, Han ^{[1
]}

Ding, Shumei ^{[1
]}

Jia, Cangzhi ^{[1
]}

机构：

[1] Dalian Maritime Univ, Sch Sci, Dalian 116026, Peoples R China

来源：

CURRENT BIOINFORMATICS | 2024年 / 19卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Super-enhancers; sequence information; XGBoost; mean-shift; clustering; under-sampling; NETWORK;

D O I：

10.2174/0115748936268302231110111456

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background Super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors, chromatin regulators, or chromatin marks. It has been reported that super-enhancers are transcriptionally more active and cell-type-specific than regular enhancers. Therefore, it is necessary to identify super-enhancers from regular enhancers. A variety of computational methods have been proposed to identify super-enhancers as auxiliary tools. However, most methods use ChIP-seq data, and the lack of this part of the data will make the predictor unable to execute or fail to achieve satisfactory performance.Objective The aim of this study is to propose a stacking computational model based on the fusion of multiple features to identify super-enhancers in both human and mouse species.Methods This work adopted mean-shift to cluster majority class samples and selected four sets of balanced datasets for mouse and three sets of balanced datasets for human to train the stacking model. Five types of sequence information are used as input to the XGBoost classifier, and the average value of the probability outputs from each classifier is designed as the final classification result.Results The results of 10-fold cross-validation and cross-cell-line validation prove that our method has superior performance compared to other existing methods. The source code and datasets are available at https://github.com/Cheng-Han-max/SE_voting.Conclusion The analysis of feature importance indicates that Mismatch accounts for the highest proportion among the top 20 important features.

引用

页码：651 / 662

页数：12

共 67 条

[1] Time series predicting of COVID-19 based on deep learning [J].

Alassafi, Madini O. ;

Jarrah, Mutasem ;

Alotaibi, Reem .

NEUROCOMPUTING, 2022, 468 :335-344

[2]

[Anonymous], 2020, ABOUT US, DOI DOI 10.1016/J.JMOLDX.2017.11.004

[3]

[Anonymous], 2020, ARXIV, DOI DOI 10.1038/NRC2044

[4]

[Anonymous], 2022, Journal of Information Security Research, DOI DOI 10.1016/J.PLIPRES.2014.11.003

[5]

[Anonymous], 2017, ARXIV, DOI DOI 10.2174/1570180817999201201113712

[6]

[Anonymous], 1996, SUPPORT VECTOR REGRE, DOI DOI 10.1073/PNAS.1508425112

[7]

[Anonymous], 1993, P 10 INT C MACH LEAR

[8] Forecasting of COVID-19 using deep layer Recurrent Neural Networks (RNNs) with Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) cells [J].

ArunKumar, K. E. ;

Kalaga, Dinesh, V ;

Kumar, Ch Mohan Sai ;

Kawaji, Masahiro ;

Brenza, Timothy M. .

CHAOS SOLITONS & FRACTALS, 2021, 146

[9] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[10] DEEPSEN: a convolutional neural network based method for super-enhancer prediction [J].

Bu, Hongda ;

Hao, Jiaqi ;

Gan, Yanglan ;

Zhou, Shuigeng ;

Guan, Jihong .

BMC BIOINFORMATICS, 2019, 20 (Suppl 15)

← 1 2 3 4 5 6 7 →