Text mining for modeling of protein complexes enhanced by machine learning

被引：4

作者：

Badal, Varsha D. ^{[1
]}

Kundrotas, Petras J. ^{[1
]}

Vakser, Ilya A. ^{[1
,2
]}

机构：

[1] Univ Kansas, Computat Biol Program, Lawrence, KS 66045 USA

[2] Univ Kansas, Dept Mol Biosci, Lawrence, KS 66045 USA

来源：

BIOINFORMATICS | 2021年 / 37卷 / 04期

关键词：

EXTRACTION; DOCKING;

D O I：

10.1093/bioinformatics/btaa823

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results: We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA fulltext articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles.

引用

页码：497 / 505

页数：9

共 66 条

[1]

Agarwala R, 2016, NUCLEIC ACIDS RES, V44, pD7, DOI [10.1093/nar/gkv1290, 10.1093/nar/gku1130]

[2]

[Anonymous], 2011, P 28 INT C MACH LEAR

[3]

[Anonymous], 2008, P ICML

[4]

[Anonymous], 2008, Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser'08

[5] Natural language processing in text mining for structural modeling of protein complexes [J].

Badal, Varsha D. ;

Kundrotas, Petras J. ;

Vakser, Ilya A. .

BMC BIOINFORMATICS, 2018, 19

[6] Text Mining for Protein Docking [J].

Badal, Varsha D. ;

Kundrotas, Petras J. ;

Vakser, Ilya A. .

PLOS COMPUTATIONAL BIOLOGY, 2015, 11 (12)

[7] Representation Learning: A Review and New Perspectives [J].

Bengio, Yoshua ;

Courville, Aaron ;

Vincent, Pascal .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828

[8]

Brants T., P JOINT C EMP METH N, P858

[9]

Caporaso J Gregory, 2008, Pac Symp Biocomput, P640

[10] New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data [J].

Caufield, J. Harry ;

Ping, Peipei .

EMERGING TOPICS IN LIFE SCIENCES, 2019, 3 (04) :357-369

← 1 2 3 4 5 6 7 →