A Case Study on Rule-based and CRF-based Author Extraction Methods

被引:0
作者
Yang, Shengwen
Xiong, Yuhong
机构
来源
IMAGING AND PRINTING IN A WEB 2.0 WORLD; AND MULTIMEDIA CONTENT ACCESS: ALGORITHMS AND SYSTEMS IV | 2010年 / 7540卷
关键词
Author Extraction; Heuristic Method; Formal Method; Conditional Random Field;
D O I
10.1117/12.838781
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Information extraction (IE) is the task of automatically extracting structured information from unstructured documents. A typical application of IE is to process a set of documents written in a natural language and populate a database with the information extracted. This paper presents a case study on author extraction from unstructured documents. A rule-based method and a CRF-based (Conditional Random Field) method are implemented for this task. The rule-based method involves defining a set of heuristic rules and leveraging prior knowledge on author names and affiliations to identify metadata. The CRF-based method involves preparing a labeled training dataset, defining a set of feature functions, learning a CRF model, and applying the model to label new documents. We evaluate and compare the performance of the two methods through experiments, and give some useful hints for application developers on the choice of heuristics and formal methods when addressing the real-world information extraction problems.
引用
收藏
页数:10
相关论文
共 14 条
[1]   Information extraction [J].
Cowie, J ;
Lehnert, W .
COMMUNICATIONS OF THE ACM, 1996, 39 (01) :80-91
[2]  
Dietterich T.G, 2002, Structural, Syntactic, and Statistical Pattern Recognition, V2396, P15, DOI 10.1007/3-540-70659-32
[3]   FORMAL METHODS VS HEURISTICS - CLARIFYING A CONTROVERSY [J].
GLASS, RL .
JOURNAL OF SYSTEMS AND SOFTWARE, 1991, 15 (02) :103-105
[4]  
Han H, 2003, ACM-IEEE J CONF DIG, P37
[5]   Automatic extraction of titles from general documents using machine learning [J].
Hu, Yunhua ;
Li, Hang ;
Cao, Yunbo ;
Teng, Li ;
Meyerzon, Dmitriy ;
Zheng, Qinghua .
INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (05) :1276-1293
[6]  
Lafferty J.D., 2001, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, P282, DOI DOI 10.5555/645530.655813
[7]  
McCallum A., 2005, ACM Queue, V3, P48, DOI 10.1145/1105664.1105679
[8]  
McCallum A., 2003, Proc Ninet Conf Uncertain Artif Intell, P403
[9]  
McCallum A., 2000, P 17 INT C MACH LEAR, P591
[10]  
Nguyen N., 2007, Proceedings of the 24th International Conference on Machine learning, P681