Protein Name Recognition Based on Dictionary Mining and Heuristics

被引:0
作者
Lin, Shian-Hua [1 ]
Ding, Shao-Hong [1 ]
Zeng, Wei-Sheng [1 ]
机构
[1] Natl Chi Nan Univ, Dept Comp Sci & Informat Engn, Puli 545, Nantou Hsien, Taiwan
来源
ALGORITHMIC ASPECTS IN INFORMATION AND MANAGEMENT, AAIM 2014 | 2014年 / 8546卷
关键词
protein name recognition; association mining; dictionary mining; heuristics; GENE; TEXT; IDENTIFICATION; PATTERNS;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We propose a novel method that integrates dictionary, heuristics and data mining approaches to efficiently and effectively recognize exact protein names from the literature. According to the protein name dictionary and heuristic rules published in related studies, core tokens of protein names can be efficiently detected. However, exact boundaries of protein names are hard to be identified. By regarding tokens of a protein name as items within a transaction, we apply mining associations to discover significant sequential patterns (SSPs) from the protein name dictionary. Based on SSPs, protein name parts are extended from core tokens to left and right boundaries for correctly recognizing the protein name. Based on Yapex101 corpus, Protein Name Recognition System (PNRS) achieves the F-score (74.49%) better than existing systems and papers.
引用
收藏
页码:75 / 87
页数:13
相关论文
共 25 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[3]  
Agrawal R., P 20 INT C VERY LARG
[4]  
[Anonymous], 2000, P 18 C COMP LING COL, DOI [DOI 10.3115/990820, DOI 10.3115/990820.990850]
[5]   GAPSCORE:: finding gene and protein names one word at a time [J].
Chang, JT ;
Schütze, H ;
Altman, RB .
BIOINFORMATICS, 2004, 20 (02) :216-225
[6]   A simple and practical dictionary-based approach for identification of proteins in medline abstracts [J].
Egorov, SR ;
Yuryev, A ;
Daraselia, N .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2004, 11 (03) :174-178
[7]   Protein names and how to find them [J].
Franzén, K ;
Eriksson, G ;
Olsson, F ;
Asker, L ;
Lidén, P ;
Cöster, J .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2002, 67 (1-3) :49-61
[8]  
Fukuda K, 1998, Pac Symp Biocomput, P707
[9]  
Hanisch Daniel, 2003, Pac Symp Biocomput, P403
[10]   Discovering patterns to extract protein-protein interactions from full texts [J].
Huang, ML ;
Zhu, XY ;
Hao, Y ;
Payan, DG ;
Qu, KB ;
Li, M .
BIOINFORMATICS, 2004, 20 (18) :3604-3612