CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction

被引:54
作者
Cui, Xuefeng [1 ]
Lu, Zhiwu [2 ]
Wang, Sheng [3 ,4 ]
Wang, Jim Jing-Yan [1 ]
Gao, Xin [1 ]
机构
[1] KAUST, CBRC, CEMSE Div, Thuwal 239556900, Saudi Arabia
[2] Renmin Univ China, Sch Informat, Beijing Key Lab Big Data Management & Anal Method, Beijing 100872, Peoples R China
[3] Toyota Technol Inst Chicago, 6045 Kenwood Ave, Chicago, IL 60637 USA
[4] Univ Chicago, Dept Human Genet, E 58th St, Chicago, IL 60637 USA
关键词
CONTACT PREDICTION; FOLD RECOGNITION; INFORMATION; SERVER; RETRIEVAL; FIELDS; MODEL;
D O I
10.1093/bioinformatics/btw271
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.
引用
收藏
页码:332 / 340
页数:9
相关论文
共 44 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], DATABASE OXFORD
[3]  
[Anonymous], 1984, Random walks and electric networks
[4]  
[Anonymous], 2006, 23 INT C MACH LEARN, DOI [10.1145/1143844.1143874, DOI 10.1145/1143844.1143874]
[5]   The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling [J].
Arnold, K ;
Bordoli, L ;
Kopp, J ;
Schwede, T .
BIOINFORMATICS, 2006, 22 (02) :195-201
[6]   ALGORITHM - SOLUTION OF MATRIX EQUATION AX+XB = C [J].
BARTELS, RH ;
STEWART, GW .
COMMUNICATIONS OF THE ACM, 1972, 15 (09) :820-&
[7]   Remote homology detection: a motif based approach [J].
Ben-Hur, Asa ;
Brutlag, Douglas .
BIOINFORMATICS, 2003, 19 :i26-i33
[8]   Improved residue contact prediction using support vector machines and a large feature set [J].
Cheng, Jianlin ;
Baldi, Pierre .
BMC BIOINFORMATICS, 2007, 8 (1)
[9]   A machine learning information retrieval approach to protein fold recognition [J].
Cheng, Jianlin ;
Baldi, Pierre .
BIOINFORMATICS, 2006, 22 (12) :1456-1463
[10]  
Cui X., 2015, P 6 ACM C BIOINF COM, P355, DOI 10.1145/2808719.2808756