Searchablewords on theWeb

被引:13
作者
Williams, Hugh E. [1 ]
Zobel, Justin [1 ]
机构
[1] RMIT Univ, Dept Comp Sci, GPOB 2476V, Melbourne, Vic 3001, Australia
基金
澳大利亚研究理事会;
关键词
Web search; Terms; Word occurrences Indexing;
D O I
10.1007/s00799-003-0050-z
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main- memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
引用
收藏
页码:99 / 105
页数:7
相关论文
共 21 条
[1]  
Baeza-Yates R. A., 1999, MODERN INFORM RETRIE
[2]   OVERVIEW OF THE 2ND TEXT RETRIEVAL CONFERENCE (TREC-2) [J].
HARMAN, D .
INFORMATION PROCESSING & MANAGEMENT, 1995, 31 (03) :271-289
[3]  
Hasan J, 2001, THESIS
[4]   Burst tries: A fast, efficient data structure for string keys [J].
Heinz, S ;
Zobel, J ;
Williams, HE .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) :192-223
[5]  
KUKICH K, 1992, COMPUT SURV, V24, P377
[6]  
Li W., 1998, COMPLEXITY, V3, P9, DOI DOI 10.1002/(SICI)1099-0526(199805/06)3:5<LESSTHAN>3::AID-CPLX1<GREATERTHAN>3.0.CO
[7]  
2-5
[8]  
Lotka A.J., 1926, J WASHINGTON ACAD SC, V16, P317, DOI DOI 10.1002/ASI.4630280610
[9]  
MOFFAT A, 1995, J AM SOC INFORM SCI, V46, P537, DOI 10.1002/(SICI)1097-4571(199508)46:7<537::AID-ASI7>3.0.CO
[10]  
2-P