Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines

被引:1
|
作者
Raja, Kalpana [1 ,2 ]
Natarajan, Jeyakumar [1 ]
机构
[1] Bharathiar Univ, Sch Life Sci, Dept Bioinformat, Data Min & Text Min Lab, Coimbatore 641046, Tamil Nadu, India
[2] Univ Michigan, Sch Med, Dept Dermatol, Ann Arbor, MI USA
关键词
Human protein phosphorylation; hPP corpus; Support Vector Machines; Natural language processing; Information extraction; Post transcriptional modification; EXTRACTION; DATABASE; SYSTEM;
D O I
10.1016/j.cmpb.2018.03.022
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background: Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes. Objective: In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature. Methods: First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form. Results: The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%. Conclusions: The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:57 / 64
页数:8
相关论文
共 50 条
  • [21] Support Vector Machines for Protein Family Identification using Surface Invariant Coordinates
    Satpute, Babasaheb
    Yadav, Raghav
    2018 3RD INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2018,
  • [22] Supervised change detection in VHR images using contextual information and support vector machines
    Volpi, Michele
    Tuia, Devis
    Bovolo, Francesca
    Kanevski, Mikhail
    Bruzzone, Lorenzo
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2013, 20 : 77 - 85
  • [23] Working set selection using second order information for training support vector machines
    Fan, RE
    Chen, PH
    Lin, CJ
    JOURNAL OF MACHINE LEARNING RESEARCH, 2005, 6 : 1889 - 1918
  • [24] Substructure prediction from infrared spectra by using support vector machines
    Liu, JH
    Lu, MC
    Nie, FS
    Feng, XY
    Li, ML
    CHINESE CHEMICAL LETTERS, 2005, 16 (10) : 1354 - 1356
  • [25] Speaker Recognition from Coded Speech Using Support Vector Machines
    Janicki, Artur
    Staroszczyk, Tomasz
    TEXT, SPEECH AND DIALOGUE, TSD 2011, 2011, 6836 : 291 - 298
  • [26] Substructure Prediction from Infrared Spectra by Using Support Vector Machines
    Jun Hong LIU
    Chinese Chemical Letters, 2005, (10) : 1354 - 1356
  • [27] Finding Conserved Regions in Protein Structures Using Support Vector Machines and Structure Alignment
    Akutsu, Tatsuya
    Hayashida, Morihiro
    Tamura, Takeyuki
    PATTERN RECOGNITION IN BIOINFORMATICS, 2012, 7632 : 233 - 242
  • [28] A two-stage classifier for protein β-turn prediction using support vector machines
    Chiu, Hua-Sheng
    Lin, Hsin-Nan
    Lo, Allan
    Sung, Ting-Yi
    Hsu, Wen-Lian
    2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, : 738 - +
  • [29] Multi-class protein subcellular localization classification using support vector machines
    Meng, PW
    Rajapakse, JC
    PROCEEDINGS OF THE 2005 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2005, : 526 - 533
  • [30] PROTEIN SECONDARY STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINES AND A NEW FEATURE REPRESENTATION
    Gubbi, Jayavardhana
    Lai, Daniel T. H.
    Palaniswami, Marimuthu
    Parker, Michael
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2006, 6 (04) : 551 - 567