The exact rank-frequency function and size-frequency function of N-grams and N-word phrases with applications

被引:2
|
作者
Egghe, L
机构
[1] Limburgs Univ Ctr, B-3590 Diepenbeek, Belgium
[2] Univ Antwerp, B-2610 Antwerp, Belgium
关键词
N-gram; N-word phrase; rank-frequency distribution; size-frequency distribution; zipflan distribution;
D O I
10.1016/j.mcm.2003.12.016
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
N-grams are generalized words consisting of N consecutive symbols (letters), as they are used in a text. N-word phrases are general concepts consisting of N consecutive words, also as used in a text. Given the rank-frequency function of single letters (i.e., one-grams) or of single words (i.e., one-word phrases) being Zipfian, we determine in this paper, the exact rank-frequency function (i.e., the occurrence of N-grams or N-word phrases on each rank) and size-frequency distribution (i.e., the density of N-grams or N-word phrases on each occurrence density) of these N-grams and N-word phrases. This paper distinguishes itself from other ones on this topic by allowing no approximations in the calculations. This leads to an intricate rank-frequency function for N-grams and N-word phrases (as we knew before from unpublished calculations) but leads surprisingly, to a very simple size-frequency function f(N) for N-grams or N-word phrases of the form fN(j) = F/j(1+1/beta) ln(N-1) (G/j), where the Zipflan distribution of single letters or words is proportional to 1/r(beta). The paper closes with the calculation of type/token averages AN and type/token-taken averages mu(N)*, for N-grams and N-word phrases, where we also verify the theoretically proved result mu(N)* >= mu(N) but where we also give estimates for the differences mu(N)* - mu(N). (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:807 / 823
页数:17
相关论文
共 33 条