Web Data Extraction with Hierarchical Clustering and Rich Features

被引:0
作者
Dong, Yongquan [1 ]
Zhao, Xiangjun [1 ]
Zhang, Gongjie [1 ]
机构
[1] Xuzhou Normal Univ, Xuzhou, Jiangsu, Peoples R China
来源
RECENT TRENDS IN MATERIALS AND MECHANICAL ENGINEERING MATERIALS, MECHATRONICS AND AUTOMATION, PTS 1-3 | 2011年 / 55-57卷
关键词
Data Extraction; Hierarchical Clustering; Feature; Deep Web;
D O I
10.4028/www.scientific.net/AMM.55-57.1003
中图分类号
TH [机械、仪表工业];
学科分类号
0802 ;
摘要
A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.
引用
收藏
页码:1003 / 1008
页数:6
相关论文
共 8 条
[1]  
ARASU A, 2003, P 2003 ACM SIGMOD IN
[2]  
Bergman MichaelK., 2001, DEEP WEB SURFACING H
[3]  
Cohen William W., 2003, P 2 INT WORKSH INF I
[4]  
Crescenzi V., 2001, P 27 INT C VER LARG
[5]  
Liu B., 2003, P 9 ACM SIGKDD INT C
[6]   VECTOR-SPACE MODEL FOR AUTOMATIC INDEXING [J].
SALTON, G ;
WONG, A ;
YANG, CS .
COMMUNICATIONS OF THE ACM, 1975, 18 (11) :613-620
[7]  
Simon Kai, 2005, P 14 ACM INT C INF K
[8]  
YI L, 2003, P 9 ACM SIGKDD INT C