A heuristic approach for converting HTML']HTML documents to XML documents

被引:0
作者
Lim, SJ [1 ]
Ng, YK [1 ]
机构
[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA
来源
COMPUTATIONAL LOGIC - CL 2000 | 2000年 / 1861卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
XML is rapidly emerging, and yet there still. exist numerous HTML documents on the Web. In this paper, we present a heuristic approach for converting HTML documents to XML documents. During the conversion process, we eliminate all the HTML elements in an HTML document from the resulting XML document since these elements are designed for the display of data exclusively, but retain the character data of each element along with the implicit hierarchy among the data. The proposed conversion approach extracts the data hierarchy of HTML documents as closely as possible with no human intervention. The approach can be adopted to construct the data hierarchy of an HTML document and to collect data in HTML documents into an XML repository.
引用
收藏
页码:1182 / 1196
页数:15
相关论文
共 10 条
[1]  
AROCENA G, 1998, P 14 ICDE
[2]  
ATEZNI P, 1997, P 16 INT S PRINC DAT, P144
[3]  
Bray Tim, 1998, Extensible markup language
[4]   WebView: A tool for retrieving internal structures and extracting information from HTML']HTML documents [J].
Lim, SJ ;
Ng, YK .
6TH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 1999, :71-80
[5]  
LIM SJ, 2000, HEURISTIC APPROACH C
[6]  
MENDELZON A, 1996, P C PAR DISTR INF SY
[7]  
Mendelzon A. O., 1997, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1997, P134, DOI 10.1145/263661.263677
[8]  
RAGGETT D, HTML 3 2 REF SPECIFI
[9]  
Raggett D., 1998, HTML 4 0 SPECIFICATI
[10]  
SAHUGUET A, 1999, P 4 INT C COOP INF S