XML-Based Web Data Pattern Discovery and Extraction

被引:0
作者
Jia, Rui [1 ]
Xu, Shicheng [1 ]
Peng, Chengbao [1 ]
机构
[1] Neusoft Corp, Shenyang, Peoples R China
来源
INFORMATION COMPUTING AND APPLICATIONS, PT 1 | 2012年 / 307卷
关键词
Web data extraction; XML Clustering; Pattern Discovery;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents an XML-based web data extraction method. This method translates web page into XML document, analyze XML document by using XPath/XSLT, discover web page data pattern and similarity by using XML clustering algorithm, construct XPath-based data extraction rule template. This method improves the robustness and versatility of web data extraction system. Experiment result shows that the data extraction method has high precision and is adaptive to web pages in different sites and with different structures.
引用
收藏
页码:708 / 715
页数:8
相关论文
共 12 条
[1]  
Arvind A., 2003, P 2003 ACM SIGMOD IN
[2]  
Ion M., 1999, AUTONOMOUS AGENTS MU, V4
[3]  
Jussi M., 2002, ROBUST WEB DATA EXTR
[4]  
Jussi M., 2001, P 10 INT C WORLD WID
[5]  
Ling L., 2000, P 16 INT C DAT ENG S
[6]  
Liu Wei, 2010, IEEE T KNOWLEDGE DAT, V22
[7]  
Pan Y., 2006, INFORM J, V25
[8]  
Qiang H., 2011, 34 INT ACM SIGIR, P775
[9]  
Shoubiao T., 2009, P 2009 6 INT C FUZZ, V7
[10]  
Xu J., 2005, COMPUTER ENG APPL, V14