Improving the performance of dictionary-based approaches in protein name recognition

被引:60
|
作者
Tsuruoka, Y
Tsujii, J
机构
[1] JST Agcy, CREST, Kawaguchi, Saitama 3320012, Japan
[2] Univ Tokyo, Dept Comp Sci, Bunkyo Ku, Tokyo 1130033, Japan
关键词
protein name recognition; naive Bayes classifier; approximate string search; spelling variant generator;
D O I
10.1016/j.jbi.2004.08.003
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%. (C) 2004 Elsevier Inc. All rights reserved.
引用
收藏
页码:461 / 470
页数:10
相关论文
共 4 条
  • [1] Protein Name Recognition Based on Dictionary Mining and Heuristics
    Lin, Shian-Hua
    Ding, Shao-Hong
    Zeng, Wei-Sheng
    ALGORITHMIC ASPECTS IN INFORMATION AND MANAGEMENT, AAIM 2014, 2014, 8546 : 75 - 87
  • [2] Flat and Nested Protein Name Recognition Based on BioBERT and Biaffine Decoder
    Tang, Zhan
    Kou, Xupeng
    Xue, Hongcheng
    Xia, Yuantian
    BIOINFORMATICS RESEARCH AND APPLICATIONS, PT I, ISBRA 2024, 2024, 14954 : 25 - 38
  • [3] Use of morphological analysis in protein name recognition
    Yamamoto, K
    Kudo, T
    Konagaya, A
    Matsumoto, Y
    JOURNAL OF BIOMEDICAL INFORMATICS, 2004, 37 (06) : 471 - 482
  • [4] Enhancing performance of protein and gene name recognizers with filtering and integration strategies
    Hou, WJ
    Chen, HH
    JOURNAL OF BIOMEDICAL INFORMATICS, 2004, 37 (06) : 448 - 460