Wrapper verification

被引:49
作者
Kushmerick N. [1 ]
机构
[1] Department of Computer Science, University College Dublin
关键词
Country Code; Relational Data Model; Semistructured Data; Word Count; Word Length;
D O I
10.1023/A:1019229612909
中图分类号
学科分类号
摘要
Many Internet information-management applications (e.g., information integration systems) require a library of wrappers, specialized information extraction procedures that translate a source's native format into a structured representation suitable for further application-specific processing. Maintaining wrappers is tedious and error-prone, because the formatting regularities on which wrappers rely change frequently on the decentralized and dynamic Internet. The wrapper verification problem is to determine whether a wrapper is operating correctly. Standard regression testing approaches are inappropriate, because both the formatting regularities on which wrappers rely and the source's underlying content may change. We introduce RAPTURE, a fully-implemented, domain-independent wrapper verification algorithm. RAPTURE computes a probabilistic similarity measure between a wrapper's expected and observed output, where similarity is defined in terms of simple numeric features (e.g., the length, or the fraction of punctuation characters) of the extracted strings. Experiments with numerous actual Internet sources demostrate that RAPTURE performs substantially better than standard regression testing. © 2000, Kluwer Academic Publishers.
引用
收藏
页码:79 / 94
页数:15
相关论文
共 17 条
[1]  
Beizer B., Black-Box Testing, (1995)
[2]  
Cohen W., Recognizing Structure in Web Pages Using Similarity Querries, Proc. 16Th Nat. Conf. AI, pp. 59-66, (1999)
[3]  
Cowie J., Lehnert W., Information Extraction, Comm. of the ACM 39, 1, pp. 80-91, (1996)
[4]  
Embley D.D., Campbell Y., Jiang Y.-K., Ng R., Smith S., Liddlequass D., A Conceptual-Modeling Approach to Extracting Data from the Web, Proc. Int. Conf. Conceptual Modeling, pp. 78-91, (1998)
[5]  
Friedmangoldszmidt N.M., Learning Bayesian Networks with Local Structure, Proc. 12Th Conf. Uncertainty in Artificial Intelligence, pp. 252-262, (1996)
[6]  
Gruser J.-B.L., Raschid M., Vidalbright L., Wrapper Generation for Web Accessible Data Sources, Proc. Conf. Cooperative Information Systems, pp. 14-23, (1998)
[7]  
Hammer J.H., Garcia-Molina J., Cho R., Aranhacrespo A., Extracting Semistructured Information from the Web, Proc. Workshop on Management of Semistructured Data, (1997)
[8]  
Hsu C., Dung M., Generating Finite-state Transducers for Semistructured Data Extraction from the Web, J. Information Systems 23, 8, pp. 521-538, (1998)
[9]  
Huck G.P., Frankhausewr K., Abererneuhold E., Jedi: Extracting and Synthesizing Information from theWeb, Proc. Conf. Cooperative Information Systems, pp. 32-43, (1998)
[10]  
(1998), Proc, (1998)