Text Mining for Protein Docking

被引:18
作者
Badal, Varsha D. [1 ]
Kundrotas, Petras J. [1 ]
Vakser, Ilya A. [1 ,2 ]
机构
[1] Univ Kansas, Ctr Computat Biol, Lawrence, KS 66045 USA
[2] Univ Kansas, Dept Mol Biosci, Lawrence, KS 66045 USA
基金
美国国家科学基金会;
关键词
SUPPORT VECTOR MACHINE; BIOMEDICAL LITERATURE; INTERACTION INFORMATION; MOLECULAR-BIOLOGY; PREDICTION; FEATURES; EXTRACTION; RETRIEVAL; ALGORITHM; ARTICLES;
D O I
10.1371/journal.pcbi.1004630
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking). Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from DOCKGROUND (http://dockground.compbio.ku.edu). The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features) approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for similar to 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the DOCKGROUND unbound benchmark set, significantly increasing the docking success rate.
引用
收藏
页数:21
相关论文
共 51 条
[1]   The relationship between sequence and interaction divergence in proteins [J].
Aloy, P ;
Ceulemans, H ;
Stark, A ;
Russell, RB .
JOURNAL OF MOLECULAR BIOLOGY, 2003, 332 (05) :989-998
[2]   Activities at the Universal Protein Resource (UniProt) [J].
Apweiler, Rolf ;
Bateman, Alex ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Alam-Faruque, Yasmin ;
Alpi, Emanuele ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Casanova, Elisabet Barrera ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chan, Wei Mun ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Fazzini, Francesco ;
Gane, Paul ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightingale, Andrew ;
Orchard, Sandra ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier ;
Zellner, Hermann ;
Corbett, Matt .
NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) :D191-D198
[3]   PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries [J].
Barbosa-Silva, Adriano ;
Fontaine, Jean-Fred ;
Donnard, Elisa R. ;
Stussi, Fernanda ;
Miguel Ortega, J. ;
Andrade-Navarro, Miguel A. .
BMC BIOINFORMATICS, 2011, 12
[4]   LAITOR - Literature Assistant for Identification of Terms co-Occurrences and Relationships [J].
Barbosa-Silva, Adriano ;
Soldatos, Theodoros G. ;
Magalhaes, Ivan L. F. ;
Pavlopoulos, Georgios A. ;
Fontaine, Jean-Fred ;
Andrade-Navarro, Miguel A. ;
Schneider, Reinhard ;
Ortega, J. Miguel .
BMC BIOINFORMATICS, 2010, 11
[5]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[6]   Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis [J].
Blohm, Philipp ;
Frishman, Goar ;
Smialowski, Pawel ;
Goebels, Florian ;
Wachinger, Benedikt ;
Ruepp, Andreas ;
Frishman, Dmitrij .
NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) :D396-D400
[7]   Comparison of support vector machine and artificial neural network systems for drug/nondrug classification [J].
Byvatov, E ;
Fechner, U ;
Sadowski, J ;
Schneider, G .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (06) :1882-1889
[8]   A text-mining system for extracting metabolic reactions from full-text articles [J].
Czarnecki, Jan ;
Nobeli, Irene ;
Smith, Adrian M. ;
Shepherd, Adrian J. .
BMC BIOINFORMATICS, 2012, 13
[9]   HADDOCK: A protein-protein docking approach based on biochemical or biophysical information [J].
Dominguez, C ;
Boelens, R ;
Bonvin, AMJJ .
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 2003, 125 (07) :1731-1737
[10]   PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine [J].
Donaldson, I ;
Martin, J ;
de Bruijn, B ;
Wolting, C ;
Lay, V ;
Tuekam, B ;
Zhang, SD ;
Baskin, B ;
Bader, GD ;
Michalickova, K ;
Pawson, T ;
Hogue, CWV .
BMC BIOINFORMATICS, 2003, 4 (1)