Constructing Reference Sets from Unstructured, Ungrammatical Text

被引:2
作者
Michelson, Matthew [1 ]
Knoblock, Craig A. [2 ]
机构
[1] Fetch Technol, El Segundo, CA 90245 USA
[2] Univ So Calif, Inst Informat Sci, Marina Del Rey, CA 90292 USA
基金
美国国家科学基金会;
关键词
D O I
10.1613/jair.2937
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vast amounts of text on the Web are unstructured and ungrammatical, such as classsified ads, auction listings, forum postings, etc. We call such text "posts." Despite their inconsistent structure and lack of grammar, posts are full of useful information. This paper presents work on semi-automatically building tables of relational information, called "reference sets," by analyzing such posts directly. Reference sets can be applied to a number of tasks such as ontology maintenance and information extraction. Our reference-set construction method starts with just a small amount of background knowledge, and constructs tuples representing the entities in the posts to form a reference set. We also describe an extension to this approach for the special case where even this small amount of background knowledge is impossible to discover and use. To evaluate the utility of the machine-constructed reference sets, we compare them to manually constructed reference sets in the context of reference-set-based information extraction. Our results show the reference sets constructed by our method outperform manually constructed reference sets. We also compare the reference-set-based extraction approach using the machine-constructed reference set to supervised extraction approaches using generic features. These results demonstrate that using machine-constructed reference sets outperforms the supervised methods, even though the supervised methods require training data.
引用
收藏
页码:189 / 221
页数:33
相关论文
共 29 条
  • [1] [Anonymous], 2008, KDD
  • [2] [Anonymous], P 45 ANN M ASS COMP
  • [3] Bast H, 2006, LECT NOTES COMPUT SC, V4289, P103
  • [4] Cafarella MichaelJ., 2005, P HLT EMNLP, P563, DOI DOI 10.3115/1220575.1220646
  • [5] Learning concept hierarchies from text corpora using formal concept analysis
    Cimiano, P
    Hotho, A
    Staab, S
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2005, 24 (24) : 305 - 339
  • [6] CIRAVEGNA F, 2001, P 17 INT JOINT C ART, P1251
  • [7] Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
  • [8] Ding YH, 2006, LECT NOTES COMPUT SC, V4185, P400
  • [9] Dupret G, 2006, LECT NOTES COMPUT SC, V4209, P37
  • [10] Conceptual-model-based data extraction from multiple-record Web pages
    Embley, DW
    Campbell, DM
    Jiang, YS
    Liddle, SW
    Lonsdale, DW
    Ng, YK
    Smith, RD
    [J]. DATA & KNOWLEDGE ENGINEERING, 1999, 31 (03) : 227 - 251