A web page segmentation algorithm for extracting product information

被引:2
作者
Wu, Changjun [1 ]
Zeng, Guosun [1 ]
Xu, Guorong [1 ]
机构
[1] Tongji Univ, Dept Comp Sci & Engn, Shanghai 201804, Peoples R China
来源
2006 IEEE INTERNATIONAL CONFERENCE ON INFORMATION ACQUISITION, VOLS 1 AND 2, CONFERENCE PROCEEDINGS | 2006年
基金
中国国家自然科学基金;
关键词
information retrieval; search engine; product block; page segmentation;
D O I
10.1109/ICIA.2006.305954
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, as the rapid development of Internet, web is becoming the most popular and also the largest resource for people to acquire information. At the same time, search engine plays an important role while retrieving information. Nevertheless, the smallest processing unit of search engine is the whole web pages, which contains plenty of noisy information. If the information can be extracted and used as the smallest processing unit, then it can place a positive effect on search engine's precision; so was born the page segmentation algorithm. However, traditional algorithms cannot extract blocks in product level. Hence, a novel algorithm, basing on product features and DOM (Document Object Mode), is proposed. Compared with those traditional algorithms, not only information consistence is greatly enhanced, but also complexity is decreased with this novel page segmentation algorithm.
引用
收藏
页码:1374 / 1379
页数:6
相关论文
共 17 条
[1]  
[Anonymous], 2003, VIPS VISION BASED PA
[2]  
CAI D, 2004, SIGIR 04 JUL 25 29
[3]  
Callan J. P., 1994, SIGIR '94. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, P302
[4]  
CHAKRABARTI S, 2001, 10 INT WORLD WID WEB
[5]  
CHAKRABARTI S, 2001, P 24 ANN INT ACM SIG, P208
[6]  
Choi J, 2001, IEICE T COMMUN, VE84B, P1694
[7]  
CRIVELLARI F, 2000, 9 TEXT RETR C TREC 9
[8]  
Embley DW, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P467, DOI 10.1145/304181.304223
[9]  
FLAKE GW, 2002, IEEE COMPUT, V35, P66
[10]  
HEEKYOUNG S, 2002, IEEE EXPLORE MAY