Text mining for modeling of protein complexes enhanced by machine learning

被引:4
作者
Badal, Varsha D. [1 ]
Kundrotas, Petras J. [1 ]
Vakser, Ilya A. [1 ,2 ]
机构
[1] Univ Kansas, Computat Biol Program, Lawrence, KS 66045 USA
[2] Univ Kansas, Dept Mol Biosci, Lawrence, KS 66045 USA
关键词
EXTRACTION; DOCKING;
D O I
10.1093/bioinformatics/btaa823
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results: We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA fulltext articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles.
引用
收藏
页码:497 / 505
页数:9
相关论文
共 66 条
[31]   Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge [J].
Krallinger, Martin ;
Morgan, Alexander ;
Smith, Larry ;
Leitner, Florian ;
Tanabe, Lorraine ;
Wilbur, John ;
Hirschman, Lynette ;
Valencia, Alfonso .
GENOME BIOLOGY, 2008, 9
[32]   Dockground: A comprehensive data resource for modeling of protein complexes [J].
Kundrotas, Petras J. ;
Anishchenko, Ivan ;
Dauzhenka, Taras ;
Kotthoff, Ian ;
Mnevets, Daniil ;
Copeland, Matthew M. ;
Vakser, Ilya A. .
PROTEIN SCIENCE, 2018, 27 (01) :172-181
[33]   Empirical Investigations into Full-Text Protein Interaction Article Categorization Task (ACT) in the BioCreative II.5 Challenge [J].
Lan, Man ;
Su, Jian .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (03) :421-427
[34]  
LeCun Y., 2015, NAT METHODS, V521, P436, DOI [DOI 10.1038/nature14539, DOI 10.1038/nmeth.3707, 10.1038/nature14539]
[35]   A text feature-based approach for literature mining of lncRNA-protein interactions [J].
Li, Ao ;
Zang, Qiguang ;
Sun, Dongdong ;
Wang, Minghui .
NEUROCOMPUTING, 2016, 206 :73-80
[36]   Is searching full text more effective than searching abstracts? [J].
Lin, Jimmy .
BMC BIOINFORMATICS, 2009, 10
[37]   Large-scale extraction of gene interactions from full-text literature using DeepDive [J].
Mallory, Emily K. ;
Zhang, Ce ;
Re, Christopher ;
Altman, Russ B. .
BIOINFORMATICS, 2016, 32 (01) :106-113
[38]  
Martin EPG, 2004, LECT NOTES ARTIF INT, V3303, P96
[39]   Challenges for automatically extracting molecular interactions from full-text articles [J].
McIntosh, Tara ;
Curran, James R. .
BMC BIOINFORMATICS, 2009, 10 :311
[40]  
Mikolov T., 2013, HLT-NAACL, P746, DOI DOI 10.3109/10826089109058901