Bootstrapping semantic annotation for content-rich HTML']HTML documents

被引:0
作者
Mukherjee, S [1 ]
Ramakrishnan, IV [1 ]
Singh, A [1 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
来源
ICDE 2005: 21ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2005年
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap, an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.
引用
收藏
页码:583 / 593
页数:11
相关论文
共 31 条
  • [1] Allan J, 2002, TOPIC DETECTION TRAC
  • [2] Ashish N, 1997, ACM SIGMOD RECORD, V26
  • [3] BUYUKKOTEN O, 2000, INT WORLD WIDE WEB C
  • [4] CHEN Y, 2003, INT WORLD WID WEB C
  • [5] CHUNG CY, 2002, INT C DAT ENG ICDE
  • [6] COHEN W, 2002, INT WORLD WID WEB C
  • [7] DILL S, 2003, INT WORLD WID WEB C
  • [8] Embley D.W., 1998, INT C INF KNOWL MAN
  • [9] EMBLEY DW, 1999, ACM C MAN DAT SIGMOD
  • [10] FENSEL D, 1998, 11 BANFF KNOWL ACQ K