Extracting a sparsely-located named entity from online HTML']HTML medical articles using support vector machine

被引:1
作者
Zou, Jie [1 ]
Le, Daniel [1 ]
Thoma, George R. [1 ]
机构
[1] Natl Lib Med, Lister Hill Natl Ctr Biomed Commun, Bethesda, MD 20894 USA
来源
DOCUMENT RECOGNITION AND RETRIEVAL XV | 2008年 / 6815卷
关键词
information extraction; support vector machine; databank accession number; named entity recognition;
D O I
10.1117/12.765907
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a statistical machine learning method for extracting databank accession numbers (DANs) from online medical journal articles. Because the DANs are sparsely-located in the articles, we take a hierarchical approach. The HTML journal articles are first segmented into zones according to text and geometric features. The zones are then classified as DAN zones or other zones by an SVM classifier. A set of heuristic rules are applied on the candidate DAN zones to extract DANs according to their edit distances to the DAN formats. An evaluation shows that the proposed method can achieve a very high recall rate (above 99%) and a significantly better precision rate compared to extraction through brute force regular expression matching.
引用
收藏
页数:10
相关论文
共 17 条
[1]   An algorithm that learns what's in a name [J].
Bikel, DM ;
Schwartz, R ;
Weischedel, RM .
MACHINE LEARNING, 1999, 34 (1-3) :211-231
[2]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[3]   Support vector machines for spam categorization [J].
Drucker, H ;
Wu, DH ;
Vapnik, VN .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05) :1048-1054
[4]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[5]  
Galavotti L, 2000, LECT NOTES COMPUT SC, V1923, P59
[6]  
Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
[7]  
KIM JD, 2004, P JOINT WORKSH NAT L
[8]  
LEE C, 2004, P JOINT WORKSH NAT L
[9]   Text categorization with support vector machines.: How to represent texts in input space? [J].
Leopold, E ;
Kindermann, J .
MACHINE LEARNING, 2002, 46 (1-3) :423-444
[10]   A guided tour to approximate string matching [J].
Navarro, G .
ACM COMPUTING SURVEYS, 2001, 33 (01) :31-88