FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately

被引:80
作者
Budowski-Tal, Inbal [1 ]
Nov, Yuval [2 ]
Kolodny, Rachel [1 ]
机构
[1] Univ Haifa, Dept Comp Sci, IL-31905 Haifa, Israel
[2] Univ Haifa, Dept Stat, IL-31905 Haifa, Israel
关键词
evaluation of structure search; fast structural search of Protein Data Bank; filter and refine; protein backbone fragments; protein structure search; STRUCTURE ALIGNMENT; STRUCTURE PREDICTION; STRUCTURE DATABASE; COMPARING PROTEIN; CLASSIFICATION; SIMILARITY; FOLD; TOOL; SEQUENCES; SPACE;
D O I
10.1073/pnas.0914097107
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Fast identification of protein structures that are similar to a specified query structure in the entire Protein Data Bank (PDB) is fundamental in structure and function prediction. We present FragBag: An ultrafast and accurate method for comparing protein structures. We describe a protein structure by the collection of its overlapping short contiguous backbone segments, and discretize this set using a library of fragments. Then, we succinctly represent the protein as a "bags-of-fragments"-a vector that counts the number of occurrences of each fragment-and measure the similarity between two structures by the similarity between their vectors. Our representation has two additional benefits: (i) it can be used to construct an inverted index, for implementing a fast structural search engine of the entire PDB, and (ii) one can specify a structure as a collection of substructures, without combining them into a single structure; this is valuable for structure prediction, when there are reliable predictions only of parts of the protein. We use receiver operating characteristic curve analysis to quantify the success of FragBag in identifying neighbor candidate sets in a dataset of over 2,900 structures. The gold standard is the set of neighbors found by six state of the art structural aligners. Our best FragBag library finds more accurate candidate sets than the three other filter methods: The SGM, PRIDE, and a method by Zotenko et al. More interestingly, FragBag performs on a par with the computationally expensive, yet highly trusted structural aligners STRUCTAL and CE.
引用
收藏
页码:3481 / 3486
页数:6
相关论文
共 36 条
[1]  
[Anonymous], 2008, Introduction to information retrieval
[2]  
[Anonymous], 2000, Permutation Tests
[3]   Rapid 3D protein structure database searching using information retrieval techniques [J].
Aung, Z ;
Tan, KL .
BIOINFORMATICS, 2004, 20 (07) :1045-1052
[4]   Rapid retrieval of protein structures from databases [J].
Aung, Zeyar ;
Tan, Kian-Lee .
DRUG DISCOVERY TODAY, 2007, 12 (17-18) :732-739
[5]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[6]   Protein fold similarity estimated by a probabilistic approach based on Cα-Cα distance comparison [J].
Carugo, O ;
Pongor, S .
JOURNAL OF MOLECULAR BIOLOGY, 2002, 315 (04) :887-898
[7]   Local feature frequency profile: A method to measure structural similarity in proteins [J].
Choi, IG ;
Kwon, J ;
Kim, SH .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (11) :3797-3802
[8]  
FEIFEI L, 2005, P 2005 IEEE COMP SOC, V2
[9]   Fragnostic: walking through protein structure space [J].
Friedberg, I ;
Godzik, A .
NUCLEIC ACIDS RESEARCH, 2005, 33 :W249-W251
[10]   Using an alignment of fragment strings for comparing protein structures [J].
Friedberg, Iddo ;
Harder, Tim ;
Kolodny, Rachel ;
Sitbon, Einat ;
Li, Zhanwen ;
Godzik, Adam .
BIOINFORMATICS, 2007, 23 (02) :E219-E224