A case-based recognition of semantic structures in HTML']HTML documents - An automated transformation from HTML']HTML to XML

被引:0
作者
Umehara, M [1 ]
Iwanuma, K [1 ]
Nabeshima, H [1 ]
机构
[1] Yamanashi Univ, Kofu, Yamanashi 4008511, Japan
来源
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002 | 2002年 / 2412卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that alignment is appropriate for recognizing characteristic semantic/logical structures of a series of HTML documents, within a framework of case-based reasoning. That is, given a series of HTML documents and a sample transformation from an HTML document into an XML format, then the alignment can identify semantic/logical structures in the remaining HTML documents of the series, by matching the text-block sequence of the remaining document with the one of the sample transformation. Several important properties of texts, such as continuity and sequentiality, can naturally be utilized by the alignment. The alignment technology can significantly improve the ability of the case-based transformation method which transforms a spatial/temporal series of HTML documents into machine-readable XML formats. Throughout experimental evaluations, we show that the case-based method with alignment achieved a highly accurate transformation of HTML documents into XML.
引用
收藏
页码:141 / 147
页数:7
相关论文
共 8 条
  • [1] [Anonymous], 1983, SIAM REV
  • [2] Ashish N., 1997, SIGMOD Record, V26, P8, DOI 10.1145/271074.271078
  • [3] Cohen WW, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P59
  • [4] HSU JY, 1997, P AAAI 97, P256
  • [5] Kushmerick N, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P74
  • [6] An automated change-detection algorithm for HTML']HTML documents based on semantic hierarchies
    Lim, SJ
    Ng, YK
    [J]. 17TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2001, : 303 - 312
  • [7] SALTON G, 1983, INTRO MODERN INFORMA
  • [8] UMEHARA M, 2000, LECT NOTES ARTIF INT, V1983, P410