Ense-i6mA: Identification of DNA N6-Methyladenine Sites Using XGB-RFE Feature Selection and Ensemble Machine Learning

被引:0
作者
Fan, Xueqiang [1 ]
Lin, Bing [1 ]
Hu, Jun [2 ]
Guo, Zhongyi [1 ]
机构
[1] Hefei Univ Technol, Sch Comp & Informat, Hefei 230009, Peoples R China
[2] Zhejiang Univ Technol, Coll Informat Engn, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
DNA; Feature extraction; Encoding; Accuracy; Bioinformatics; Genomics; Benchmark testing; DNA N-6-methyladenine sites; sequence-based encoding; bioinformatics; feature selection; ensemble learning; N6-METHYLADENINE SITES; METHYLATION; GENOME; PACKAGE; MODES;
D O I
10.1109/TCBB.2024.3421228
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
DNA N-6-methyladenine (6mA) is an important epigenetic modification that plays a vital role in various cellular processes. Accurate identification of the 6mA sites is fundamental to elucidate the biological functions and mechanisms of modification. However, experimental methods for detecting 6mA sites are high-priced and time-consuming. In this study, we propose a novel computational method, called Ense-i6mA, to predict 6mA sites. Firstly, five encoding schemes, i.e., one-hot encoding, gcContent, Z-Curve, K-mer nucleotide frequency, and K-mer nucleotide frequency with gap, are employed to extract DNA sequence features. Secondly, eXtreme gradient boosting coupled with recursive feature elimination is applied to remove noisy features for avoiding over-fitting, reducing computing time and complexity. Then, the best subset of features is fed into base-classifiers composed of Extra Trees, eXtreme Gradient Boosting, Light Gradient Boosting Machine, and Support Vector Machine. Finally, to minimize generalization errors, the prediction probabilities of the base-classifiers are aggregated by averaging for inferring the final 6mA sites results. We conduct experiments on two species, i.e., Arabidopsis thaliana and Drosophila melanogaster, to compare the performance of Ense-i6mA against the recent 6mA sites prediction methods. The experimental results demonstrate that the proposed Ense-i6mA achieves area under the receiver operating characteristic curve values of 0.967 and 0.968, accuracies of 91.4% and 92.0%, and Mathew's correlation coefficient values of 0.829 and 0.842 on two benchmark datasets, respectively, and outperforms several existing state-of-the-art methods.
引用
收藏
页码:1842 / 1854
页数:13
相关论文
共 49 条
[1]   SpineNet-6mA: A Novel Deep Learning Tool for Predicting DNA N6-Methyladenine Sites in Genomes [J].
Abbas, Zeeshan ;
Tayara, Hilal ;
Chong, Kil To .
IEEE ACCESS, 2020, 8 :201450-201457
[2]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[3]  
Albawi S, 2017, I C ENG TECHNOL
[4]  
Augustine Jisha, 2022, International Conference on Innovative Computing and Communications: Proceedings of ICICC 2021. Advances in Intelligent Systems and Computing (1388), P777, DOI 10.1007/978-981-16-2597-8_67
[5]   Summarizing and correcting the GC content bias in high-throughput sequencing [J].
Benjamini, Yuval ;
Speed, Terence P. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (10) :e72
[6]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[7]   Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE [J].
Chen, Qi ;
Meng, Zhaopeng ;
Liu, Xinyi ;
Jin, Qianguo ;
Su, Ran .
GENES, 2018, 9 (06)
[8]  
Chen T., 2015, R package version 0, P1
[9]   i6mA-Pred: identifying DNA N6 - methyladenine sites in the rice genome [J].
Chen, Wei ;
Lv, Hao ;
Nie, Fulei ;
Lin, Hao .
BIOINFORMATICS, 2019, 35 (16) :2796-2800
[10]   PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions [J].
Chen, Wei ;
Zhang, Xitong ;
Brooker, Jordan ;
Lin, Hao ;
Zhang, Liqing ;
Chou, Kuo-Chen .
BIOINFORMATICS, 2015, 31 (01) :119-+