A MODEL FOR WORD CLUSTERING

被引：0

作者：

THOM, JA

ZOBEL, J

机构：

[1] Department of Computer Science, Royal Melbourne Institute of Technology, Melbourne, 3001

来源：

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE | 1992年 / 43卷 / 09期

关键词：

D O I：

10.1002/(SICI)1097-4571(199210)43:9<616::AID-ASI4>3.0.CO;2-A

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

It is common to model the distribution of words in text by measures such as the Poisson approximation. However, these measures ignore effects such as clustering: our analysis of document collections demonstrates that the Poisson approximation can significantly overestimate the probability that a document contains a word. Based on our analysis, we propose a new model for distribution of words in text, and show how this model can be used to estimate the probability that a document contains a word and the number of distinct words in a document.

引用

页码：616 / 627

页数：12

共 19 条

[1]

Carroll J. B., 1967, COMPUTATIONAL ANAL P, P406

[2] IMPLICATIONS OF CERTAIN ASSUMPTIONS IN DATABASE PERFORMANCE EVALUATION [J].

CHRISTODOULAKIS, S .

ACM TRANSACTIONS ON DATABASE SYSTEMS, 1984, 9 (02) :163-186

[3] DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].

CLEARY, JG ;

WITTEN, IH .

IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402

[4]

Daniel C., 1980, FITTING EQUATIONS DA

[5]

DEVOR JL, 1982, PROBABILITY STATISTI

[6]

Feller W., 1968, INTRO PROBABILITY TH, V1st

[7]

KENT A, 1990, J AM SOC INFORM SCI, V41, P508, DOI 10.1002/(SICI)1097-4571(199010)41:7<508::AID-ASI5>3.0.CO

[8]

2-J

[9]

LOVINS JB, 1968, MECH TRANSL, V11, P22

[10]

Mandelbrot Benoit, 1952, P S APPL COMM THEOR, P486

← 1 2 →