A MODEL FOR WORD CLUSTERING

被引:0
作者
THOM, JA
ZOBEL, J
机构
[1] Department of Computer Science, Royal Melbourne Institute of Technology, Melbourne, 3001
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE | 1992年 / 43卷 / 09期
关键词
D O I
10.1002/(SICI)1097-4571(199210)43:9<616::AID-ASI4>3.0.CO;2-A
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
It is common to model the distribution of words in text by measures such as the Poisson approximation. However, these measures ignore effects such as clustering: our analysis of document collections demonstrates that the Poisson approximation can significantly overestimate the probability that a document contains a word. Based on our analysis, we propose a new model for distribution of words in text, and show how this model can be used to estimate the probability that a document contains a word and the number of distinct words in a document.
引用
收藏
页码:616 / 627
页数:12
相关论文
共 19 条
[1]  
Carroll J. B., 1967, COMPUTATIONAL ANAL P, P406
[2]   IMPLICATIONS OF CERTAIN ASSUMPTIONS IN DATABASE PERFORMANCE EVALUATION [J].
CHRISTODOULAKIS, S .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1984, 9 (02) :163-186
[3]   DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].
CLEARY, JG ;
WITTEN, IH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402
[4]  
Daniel C., 1980, FITTING EQUATIONS DA
[5]  
DEVOR JL, 1982, PROBABILITY STATISTI
[6]  
Feller W., 1968, INTRO PROBABILITY TH, V1st
[7]  
KENT A, 1990, J AM SOC INFORM SCI, V41, P508, DOI 10.1002/(SICI)1097-4571(199010)41:7<508::AID-ASI5>3.0.CO
[8]  
2-J
[9]  
LOVINS JB, 1968, MECH TRANSL, V11, P22
[10]  
Mandelbrot Benoit, 1952, P S APPL COMM THEOR, P486