Computational identification of multiple lysine PTM sites by analyzing the instance hardness and feature importance

被引:9
作者
Ahmed, Sabit [1 ]
Rahman, Afrida [1 ]
Hasan, Md Al Mehedi [1 ]
Ahmad, Shamim [2 ]
Shovan, S. M. [1 ]
机构
[1] Rajshahi Univ Engn & Technol, Comp Sci & Engn, Rajshahi 6204, Bangladesh
[2] Univ Rajshahi, Comp Sci & Engn, Rajshahi 6205, Bangladesh
基金
英国科研创新办公室;
关键词
PROTEASE CLEAVAGE SITES; FEATURE-SELECTION; PREDICTION; PROTEINS; SUCCINYLATION; SVM;
D O I
10.1038/s41598-021-98458-y
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Identification of post-translational modifications (PTM) is significant in the study of computational proteomics, cell biology, pathogenesis, and drug development due to its role in many bio-molecular mechanisms. Though there are several computational tools to identify individual PTMs, only three predictors have been established to predict multiple PTMs at the same lysine residue. Furthermore, detailed analysis and assessment on dataset balancing and the significance of different feature encoding techniques for a suitable multi-PTM prediction model are still lacking. This study introduces a computational method named 'iMul-kSite' for predicting acetylation, crotonylation, methylation, succinylation, and glutarylation, from an unrecognized peptide sample with one, multiple, or no modifications. After successfully eliminating the redundant data samples from the majority class by analyzing the hardness of the sequence-coupling information, feature representation has been optimized by adopting the combination of ANOVA F-Test and incremental feature selection approach. The proposed predictor predicts multi-label PTM sites with 92.83% accuracy using the top 100 features. It has also achieved a 93.36% aiming rate and 96.23% coverage rate, which are much better than the existing state-of-the-art predictors on the validation test. This performance indicates that 'iMul-kSite' can be used as a supportive tool for further K-PTM study.
引用
收藏
页数:12
相关论文
共 52 条
[1]   predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance [J].
Ahmed, Sabit ;
Rahman, Afrida ;
Hasan, Md. Al Mehedi ;
Ben Islam, Md Khaled ;
Rahman, Julia ;
Ahmad, Shamim .
PLOS ONE, 2021, 16 (04)
[2]   Solving the protein sequence metric problem [J].
Atchley, WR ;
Zhao, JP ;
Fernandes, AD ;
Drüke, T .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (18) :6395-6400
[3]   2-hydr_Ensemble: Lysine 2-hydroxyisobutyrylation identification with ensemble method [J].
Bao, Wenzheng ;
Yang, Bin ;
Chen, Baitong .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2021, 215
[4]   CMSENN: Computational Modification Sites with Ensemble Neural Network [J].
Bao, Wenzheng ;
Yang, Bin ;
Li, Dan ;
Li, Zhengwei ;
Zhou, Yong ;
Bao, Rong .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2019, 185 :65-72
[5]   UniProt: a worldwide hub of protein knowledge [J].
Bateman, Alex ;
Martin, Maria-Jesus ;
Orchard, Sandra ;
Magrane, Michele ;
Alpi, Emanuele ;
Bely, Benoit ;
Bingley, Mark ;
Britto, Ramona ;
Bursteinas, Borisas ;
Busiello, Gianluca ;
Bye-A-Jee, Hema ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Georghiou, George ;
Gonzales, Daniel ;
Gonzales, Leonardo ;
Hatton-Ellis, Emma ;
Ignatchenko, Alexandr ;
Ishtiaq, Rizwan ;
Jokinen, Petteri ;
Joshi, Vishal ;
Jyothi, Dushyanth ;
Lopez, Rodrigo ;
Luo, Jie ;
Lussi, Yvonne ;
MacDougall, Alistair ;
Madeira, Fabio ;
Mahmoudy, Mahdi ;
Menchi, Manuela ;
Nightingale, Andrew ;
Onwubiko, Joseph ;
Palka, Barbara ;
Pichler, Klemens ;
Pundir, Sangya ;
Qi, Guoying ;
Raj, Shriya ;
Renaux, Alexandre ;
Lopez, Milagros Rodriguez ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Speretta, Elena ;
Turner, Edward ;
Tyagi, Nidhi ;
Vasudev, Preethi ;
Volynkin, Vladimir ;
Wardell, Tony .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D506-D515
[6]  
Batuwita R, 2010, IEEE IJCNN
[7]   Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix [J].
Chandra, Abel ;
Sharma, Alok ;
Dehzangi, Abdollah ;
Shigemizu, Daichi ;
Tsunoda, Tatsuhiko .
BMC MOLECULAR AND CELL BIOLOGY, 2019, 20 (Suppl 2)
[8]  
Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, DOI [DOI 10.1145/1961189.1961199, 10.1145/1961189.1961199]
[9]   Selecting genes by test statistics [J].
Chen, DC ;
Liu, ZQ ;
Ma, XB ;
Hua, D .
JOURNAL OF BIOMEDICINE AND BIOTECHNOLOGY, 2005, (02) :132-138
[10]   Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs [J].
Chen, Yong-Zi ;
Tang, Yu-Rong ;
Sheng, Zhi-Ya ;
Zhang, Ziding .
BMC BIOINFORMATICS, 2008, 9 (1) :101