The meta book and size-dependent properties of written language

被引:31
作者
Bernhardsson, Sebastian [1 ]
da Rocha, Luis Enrique Correa [1 ]
Minnhagen, Petter [1 ]
机构
[1] Umea Univ, Dept Phys, S-90187 Umea, Sweden
关键词
DISTRIBUTIONS;
D O I
10.1088/1367-2630/11/12/123015
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Evidence is presented for a systematic text-length dependence of the power-law index gamma of a single book. The estimated gamma values are consistent with a monotonic decrease from 2 to 1 with increasing text length. A direct connection to an extended Heap's law is explored. The infinite book limit is, as a consequence, proposed to be given by gamma = 1 instead of the value gamma = 2 expected if Zipf's law is universally applicable. In addition, we explore the idea that the systematic text-length dependence can be described by a meta book concept, which is an abstract representation reflecting the word-frequency structure of a text. According to this concept the word-frequency distribution of a text, with a certain length written by a single author, has the same characteristics as a text of the same length extracted from an imaginary complete infinite corpus written by the same author.
引用
收藏
页数:15
相关论文
共 16 条
[1]  
[Anonymous], 2003, Internet mathematics, DOI [10.1080/15427951.2004.10129088, DOI 10.1080/15427951.2004.10129088]
[2]  
[Anonymous], 1999, The Origins of Life
[3]  
Baayen R. H., 2001, WORD FREQUENCY DISTR, V18
[4]   Family name distributions: Master equation approach [J].
Baek, Seung Ki ;
Kiet, Hoang Anh Tuan ;
Kim, Beom Jun .
PHYSICAL REVIEW E, 2007, 76 (04)
[5]   Size-dependent word frequencies and translational invariance of books [J].
Bernhardsson, Sebastian ;
da Rocha, Luis Enrique Correa ;
Minnhagen, Petter .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2010, 389 (02) :330-341
[6]   Power-Law Distributions in Empirical Data [J].
Clauset, Aaron ;
Shalizi, Cosma Rohilla ;
Newman, M. E. J. .
SIAM REVIEW, 2009, 51 (04) :661-703
[7]  
Ferrer-i-Cancho R., 2001, J QUANT LINGUIST, V8, P165, DOI [DOI 10.1076/JQUL.8.3.165.4101, 10.1076/jqul.8.3.165.4101]
[8]  
Heaps H.S., 1978, Information Retrieval: Computational and Theoretical Aspects
[9]   Distribution of Korean family names [J].
Kim, BJ ;
Park, SM .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2005, 347 :683-694
[10]  
Mandelbrot B., 1953, INFORM THEORY STAT S