Prediction of Essential Proteins Using Genetic Algorithm as a Feature Selection Technique

被引:0
作者
Inzamam-Ul-Hossain, Md. [1 ]
Islam, Md. Rafiqul [1 ,2 ]
机构
[1] Khulna Univ, Dept Comp Sci & Engn, Khulna 9208, Bangladesh
[2] Amer Int Univ Bangladesh AIUB, Dept Comp Sci, Dhaka 1229, Bangladesh
关键词
Proteins; Genetic algorithms; Accuracy; Feature extraction; Biological cells; Encoding; Random forests; Topology; Biological feature; composite features; essential proteins; genetic algorithm; SMOTE-ENN; SOMTE-Tomek; topological feature; DATABASE; CLASSIFICATION; OPTIMIZATION; EXPRESSION;
D O I
10.1109/ACCESS.2024.3446992
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Essential proteins play a vital role in the preparation of antibiotics, disease diagnosis, and understanding the structure of an organism. It is crucial for cell survival and is associated with various human diseases. Recently, many methods have been proposed for identifying essential proteins. These methods improve the accuracy of identifying essential proteins, but there is still a gap between the highest achievable accuracy and the accuracy achieved by these methods. Also, the other performance metrics, such as recall, specificity, and F1-score, are still very low. Because of the importance of the essential proteins and the lack of performance of past research work, an efficient approach is proposed to predict the essential proteins with high performance. This paper uses a genetic algorithm-based feature selection technique to get the optimal number of features to identify the essential proteins. For data balancing, different techniques are used to get the best-balanced dataset. Both topological and biological features are used in this method. The Saccharomyces cerevisiae (S.cerevisiae) dataset is used to evaluate the proposed method. Another dataset of the species Escherichia coli (E.coli) is used to validate the performance of this method. Any of the three classification techniques, such as Random Forest, LightGBM, and XGBoost, are used individually in the genetic algorithm's fitness function to calculate the accuracy and F1-score average. The proposed method produces the best performance metrics in both datasets with a smaller number of features than the original features. The highest accuracy achieved for the S.cerevisiae dataset is 94.69% and 95.11% for the E.coli dataset. Other performance scores, such as recall and F1-score, are also high compared to the existing methods. The proposed method was compared with other existing methods and showed that it outperformed other existing methods in experimental results.
引用
收藏
页码:126200 / 126220
页数:21
相关论文
共 77 条
[1]  
Aalaei S, 2016, IRAN J BASIC MED SCI, V19, P476
[2]  
Akbari R., 2011, International J. of Industrial Eng. Computations, V2, P419, DOI DOI 10.5267/J.IJIEC.2010.03.002
[3]  
Al Majzoub Hisham, 2020, International Journal of Machine Learning and Computing, V10, P39, DOI 10.18178/ijmlc.2020.10.1.894
[4]  
Albuquerque IMR, 2020, 2020 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), P616, DOI 10.1109/SSCI47803.2020.9308307
[5]   Essential gene prediction in Drosophila melanogaster using machine learning approaches based on sequence and functional features [J].
Aromolaran, Olufemi ;
Beder, Thomas ;
Oswald, Marcus ;
Oyelade, Jelili ;
Adebiyi, Ezekiel ;
Koenig, Rainer .
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2020, 18 :612-621
[6]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[7]  
Asuncion A., 2007, UCI machine learning repository
[8]   Automatic test case optimization:: A bacteriologic algorithm [J].
Baudry, B ;
Fleurey, F ;
Jézéquel, JM ;
Le Traon, Y .
IEEE SOFTWARE, 2005, 22 (02) :76-+
[9]   COMPARTMENTS: unification and visualization of protein subcellular localization evidence [J].
Binder, Janos X. ;
Pletscher-Frankild, Sune ;
Tsafou, Kalliopi ;
Stolte, Christian ;
O'Donoghue, Sean I. ;
Schneider, Reinhard ;
Jensen, Lars Juhl .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2014,
[10]  
Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350