Conceptual-model-based data extraction from multiple-record Web pages

被引:147
作者
Embley, DW [1 ]
Campbell, DM [1 ]
Jiang, YS [1 ]
Liddle, SW [1 ]
Lonsdale, DW [1 ]
Ng, YK [1 ]
Smith, RD [1 ]
机构
[1] Brigham Young Univ, Data Extract Grp, Provo, UT 84602 USA
关键词
data extraction; data structuring; unstructured data; data-rich document; World-Wide Web; ontology; ontological conceptual modeling; obituaries;
D O I
10.1016/S0169-023X(99)00027-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich, multiple-record documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology - a conceptual model instance - that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Our approach is less labor-intensive than other approaches that manually or semiautomatically generate wrappers, and it is generally insensitive to changes in Web-page format. (C) 1999 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:227 / 251
页数:25
相关论文
共 33 条
  • [1] Querying documents in object databases
    Abiteboul S.
    Cluet S.
    Christophides V.
    Milo T.
    Moerkotte G.
    Siméon J.
    [J]. International Journal on Digital Libraries, 1997, 1 (1) : 5 - 19
  • [2] Adelberg Brad, 1998, SIGMOD, 1998, P283, DOI [10.1145/276304.276330, DOI 10.1145/276304.276330]
  • [3] [Anonymous], INT J DIGIT LIB
  • [4] [Anonymous], 1998, ARTIF INTELL
  • [5] APERS PMG, 1994, P 2 INT E W DAT WORK, P183
  • [6] AROCENA G, 1998, P 14 INT C DAT ENG
  • [7] Ashish N., 1997, SIGMOD Record, V26, P8, DOI 10.1145/271074.271078
  • [8] Atzeni P, 1997, PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, P206
  • [9] Atzeni P., 1997, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1997, P144, DOI 10.1145/263661.263678
  • [10] Buneman P, 1996, P ACM SIGMOD INT C M, P505