The Entropy of WordsLearnability and Expressivity across More than 1000 Languages

被引:67
作者
Bentz, Christian [1 ,2 ]
Alikaniotis, Dimitrios [3 ]
Cysouw, Michael [4 ]
Ferrer-i-Cancho, Ramon [5 ]
机构
[1] Univ Tubingen, DFG Ctr Adv Studies, Rumelinstr 23, D-72070 Tubingen, Germany
[2] Univ Tubingen, Dept Gen Linguist, Wilhemstr 19-23, D-72074 Tubingen, Germany
[3] Univ Cambridge, Dept Theoret & Appl Linguist, Cambridge CB3 9DP, England
[4] Philipps Univ Marburg, Forschungszentrum Deutsch Sprachatlas, Pilgrimstein 16, D-35032 Marburg, Germany
[5] Univ Politecn Cataluna, Dept Ciencies Comp, LARCA Res Grp, Complex & Quantitat Linguist Lab, ES-08034 Barcelona, Catalonia, Spain
关键词
natural language entropy; entropy rate; unigram entropy; quantitative language typology; COMPRESSION;
D O I
10.3390/e19060275
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.
引用
收藏
页数:32
相关论文
共 83 条
[1]   MORPHOLOGICAL ORGANIZATION: THE LOW CONDITIONAL ENTROPY CONJECTURE [J].
Ackerman, Farrell ;
Malouf, Robert .
LANGUAGE, 2013, 89 (03) :429-464
[2]  
Agresti A., 2005, Stat. Methods Appl., V14, P297, DOI DOI 10.1007/S10260-005-0121-Y
[3]  
Baayen R. H., 2012, WORD FREQUENCY DISTR
[4]  
Baroni M., 2009, INT HDB LUDELING, p803 821
[5]  
Basharin G. P., 1959, Theory of Probability & Its Applications, V4, P333, DOI [10.1137/1104033, DOI 10.1137/1104033]
[6]   Fitting Linear Mixed-Effects Models Using lme4 [J].
Bates, Douglas ;
Maechler, Martin ;
Bolker, Benjamin M. ;
Walker, Steven C. .
JOURNAL OF STATISTICAL SOFTWARE, 2015, 67 (01) :1-48
[7]  
Behr F, 2003, IEEE DATA COMPR CONF, P416
[8]  
Bentz C., 2016, P 11 INT C EV OLANG1
[9]  
Bentz C., 2016, P LEID WORKSH CAPT P, P1, DOI DOI 10.15496/PUBLIKATION-10057
[10]  
Bentz C., 2016, P WORKSH COMP LING L, P142