A fully automated object extraction system for the World Wide Web

被引:61
作者
Buttler, D [1 ]
Liu, L [1 ]
Pu, C [1 ]
机构
[1] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
来源
21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS | 2001年
关键词
D O I
10.1109/ICDSC.2001.918966
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a fully automated object extraction system - Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web pages over 40 sites. It achieves 100% precision (returns only correct objects) and excellent recall (between 99% and 98%, with very few significant objects left out). The object boundary identification algorithms are fast, about 0.1 second per page with a simple optimization.
引用
收藏
页码:361 / 370
页数:10
相关论文
共 13 条
[1]  
ADELBERG B, 1998, ACM SIGMOD
[2]  
ASHISH N, 1997, P COOP C
[3]  
ATZENI P, 1997, P 16 ACM SIGMOD S PR
[4]  
BUTTLER D, 2000, OMINI OBJECT MINING
[5]  
Doorenbos R. B., 1997, Proceedings of the First International Conference on Autonomous Agents, P39, DOI 10.1145/267658.267666
[6]  
EMBLEY DW, 1999, P 1999 ACM SIGMOD PH
[7]   A SOFTBOT-BASED INTERFACE TO THE INTERNET [J].
ETZIONI, O .
COMMUNICATIONS OF THE ACM, 1994, 37 (07) :72-76
[8]  
Hammer J., 1997, Proceedings of the Workshop on Management of Semi-Structured Data, P18
[9]  
Higgins J J., 1995, Concepts in Probability and Stochastic Modeling
[10]  
KUSHMERICK N, 1997, P INT JOINT C ART IN