Analysis of lexical signatures for improving information persistence on the World Wide Web

被引:16
作者
Park, ST
Pennock, DM
Giles, CL
Krovetz, R
机构
[1] Yahoo Res Labs, Pasadena, CA 91103 USA
[2] Penn State Univ, Sch Informat Sci & Technol, University Pk, PA 16802 USA
[3] Ask Jeeves Inc, Emeryville, CA 94608 USA
关键词
algorithms; experimentation; measurement; performance; reliability; verification; broken URLs; dead links; digital libraries; indexing; inverse document frequency; information retrieval; lexical signatures; robust hyperlinks; search engines; term frequency; TREC; World Wide Web;
D O I
10.1145/1028099.1028101
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency-(TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.
引用
收藏
页码:540 / 572
页数:33
相关论文
共 40 条
  • [1] WEBLINKER, A TOOL FOR MANAGING WWW CROSS-REFERENCES
    AIMAR, A
    CASEY, J
    DRAKOS, N
    HANNELL, I
    KHODABANDEH, A
    PALAZZI, P
    ROUSSEAU, B
    RUGGIER, M
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1995, 28 (1-2): : 99 - 107
  • [2] Andrews K., 1995, J UNIVERS COMPUT SCI, V1, P206, DOI DOI 10.1007/978-3-642-80350-5_20
  • [3] [Anonymous], 2000, P ACM SIGMOD INT C M, DOI DOI 10.1145/342009.335391
  • [4] [Anonymous], 1966, DESIGN EXPT
  • [5] ARMS W, 1997, D LIB MAG
  • [6] Bell T. C., 1999, Managing Gigabytes, V2nd ed
  • [7] Berners-Lee Tim, 1996, Hypertext transfer protocol-HTTP/1.0
  • [8] BHARAT K, 1998, P 7 INT WORLD WID WE, P379
  • [9] How dynamic is the Web?
    Brewington, BE
    Cybenko, G
    [J]. COMPUTER NETWORKS, 2000, 33 (1-6) : 257 - 276
  • [10] CREECH ML, 1996, P 5 INT WORLD WID WE