Zipf's Law for Word Frequencies: Word Forms versus Lemmas in Long Texts

被引:57
作者
Corral, Alvaro [1 ,2 ]
Boleda, Gemma [3 ]
Ferrer-i-Cancho, Ramon [4 ]
机构
[1] Ctr Recerca Matemat, Barcelona, Spain
[2] Univ Autonoma Barcelona, E-08193 Barcelona, Spain
[3] Univ Pompeu Fabra, Dept Translat & Language Sci, Barcelona, Spain
[4] Univ Politecn Cataluna, Dept Ciencies Computacio, Complex & Quantitat Linguist Lab, Barcelona, Spain
关键词
DISTRIBUTIONS; EVOLUTION;
D O I
10.1371/journal.pone.0129031
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkavble transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.
引用
收藏
页数:23
相关论文
共 57 条
[1]  
Abdi H., 2007, Encyclopedia of measurement and statistics, V3, P103
[2]  
[Anonymous], 2001, Word Frequency Distributions
[3]  
[Anonymous], 2011, THESIS
[4]  
[Anonymous], ANIM BEHAV
[5]   Zipf distribution of US firm sizes [J].
Axtell, RL .
SCIENCE, 2001, 293 (5536) :1818-1820
[6]   The Evolution of the Exponent of Zipf's Law in Language Ontogeny [J].
Baixeries, Jaume ;
Elvevag, Brita ;
Ferrer-i-Cancho, Ramon .
PLOS ONE, 2013, 8 (03)
[7]  
Baroni M, 2009, HANDB SPRACH KOMMUN, V29, P803
[8]   Zipf's law and the grammar of languages: A quantitative study of Old and Modern English parallel texts [J].
Bentz, Christian ;
Kiela, Douwe ;
Hill, Felix ;
Buttery, Paula .
CORPUS LINGUISTICS AND LINGUISTIC THEORY, 2014, 10 (02) :175-211
[9]   MULTIPLE SIGNIFICANCE TESTS - THE BONFERRONI METHOD .10. [J].
BLAND, JM ;
ALTMAN, DG .
BRITISH MEDICAL JOURNAL, 1995, 310 (6973) :170-170
[10]   Power-Law Distributions in Empirical Data [J].
Clauset, Aaron ;
Shalizi, Cosma Rohilla ;
Newman, M. E. J. .
SIAM REVIEW, 2009, 51 (04) :661-703