IDP-Seq2Seq: identification of intrinsically disordered regions based on sequence to sequence learning

被引:128
作者
Tang, Yi-Jun [1 ]
Pang, Yi-He [1 ]
Liu, Bin [1 ,2 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 100081, Peoples R China
[2] Beijing Inst Technol, Adv Res Inst Multidisciplinary Sci, Beijing 100081, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
ACCURATE PREDICTION; UNSTRUCTURED REGIONS; PROTEIN DISORDER; HUMAN-DISEASES; GENERATION; LANGUAGE; MODEL;
D O I
10.1093/bioinformatics/btaa667
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of IDRs is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the 'semantic space' to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. Results: In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to 'semantic space' to reflect the structure patterns with the help of predicted residue-residue contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was used to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction and IDP-Seq2Seq-G for both long and short disordered region predictions. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods.
引用
收藏
页码:5177 / 5186
页数:10
相关论文
共 62 条
[1]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[2]   Accurate prediction of solvent accessibility using neural networks-based regression [J].
Adamczak, R ;
Porollo, A ;
Meller, J .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 56 (04) :753-767
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[5]  
Baruh L, 2009, INT C WEBL SOC MED
[6]   ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank [J].
Chen, Junjie ;
Guo, Mingyue ;
Li, Shumin ;
Liu, Bin .
BIOINFORMATICS, 2017, 33 (21) :3473-3476
[7]   Accurate prediction of protein disordered regions by mining protein structure data [J].
Cheng, JL ;
Sweredoski, MJ ;
Baldi, P .
DATA MINING AND KNOWLEDGE DISCOVERY, 2005, 11 (03) :213-222
[8]  
Cho K., 2014, C EMP METH NAT LANG, P1724, DOI [10.3115/v1/d14-1179, DOI 10.3115/V1/D14-1179]
[9]  
Chung J., 2014, ADV NEURAL INFORM PR
[10]   A comprehensive overview of computational protein disorder prediction methods [J].
Deng, Xin ;
Eickholt, Jesse ;
Cheng, Jianlin .
MOLECULAR BIOSYSTEMS, 2012, 8 (01) :114-121