Machine learning for information extraction in informal domains

被引:121
作者
Freitag, D [1 ]
机构
[1] Justsyst Pittsburgh Res Ctr, Pittsburgh, PA 15213 USA
关键词
information extraction; multistrategy learning;
D O I
10.1023/A:1007601113994
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider the problem of learning to perform information extraction in domains where linguistic processing is problematic, such as Usenet posts, email, and finger plan files. In place of syntactic and semantic information, other sources of information can be used, such as term frequency, typography, formatting, and mark-up. We describe four learning approaches to this problem, each drawn from a different paradigm: a rote learner, a term-space learner based on Naive Bayes, an approach using grammatical induction, and a relational rule learner. Experiments on 14 information extraction problems defined over four diverse document collections demonstrate the effectiveness of these approaches. Finally, we describe a multistrategy approach which combines these learners and yields performance competitive with or better than the best of them. This technique is modular and flexible, and could find application in other machine learning problems.
引用
收藏
页码:169 / 202
页数:34
相关论文
共 49 条
[1]  
[Anonymous], P 13 INT JOINT C ART
[2]  
[Anonymous], [No title captured]
[3]  
[Anonymous], P 2 INT C INF KNOWL
[4]  
Aone C., 1996, Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, P302
[5]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[6]  
AUGUST SE, 1992, P 4 MESS UND C MUC 4, P189
[7]  
Bikel D.M., 1997, Proceedings of the fifth conference on Applied natural language processing. Association for Computational Linguistics, P194
[8]  
CALIFF ME, 1998, THESIS U TEXAS AUSTI
[9]  
CARDIE C, 1993, PROCEEDINGS OF THE ELEVENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P798
[10]  
Cardie C, 1997, AI MAG, V18, P65