Rank Diversity of Languages: Generic Behavior in Computational Linguistics

被引:23
作者
Cocho, Germinal [1 ,2 ]
Flores, Jorge [1 ]
Gershenson, Carlos [2 ,3 ]
Pineda, Carlos [1 ]
Sanchez, Sergio [4 ]
机构
[1] Univ Nacl Autonoma Mexico, Inst Fis, Mexico City 04510, DF, Mexico
[2] Univ Nacl Autonoma Mexico, Ctr Ciencias Complejidad, Mexico City 04510, DF, Mexico
[3] Univ Nacl Autonoma Mexico, Inst Invest Matemat Aplicadas & Sistemas, Mexico City 04510, DF, Mexico
[4] Univ Nacl Autonoma Mexico, Fac Ciencias, Mexico City 04510, DF, Mexico
关键词
EVOLUTION; LAW; COMPLEX; DISTRIBUTIONS; CULTURE; WORDS;
D O I
10.1371/journal.pone.0121898
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
引用
收藏
页数:12
相关论文
共 51 条
[31]  
Hornby A.S., 2005, Oxford Advanced Learner's Dictionary
[32]   Innateness and culture in the evolution of language [J].
Kirby, Simon ;
Dowman, Mike ;
Griffiths, Thomas L. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (12) :5241-5245
[33]  
Mandelbrot B., 1953, COMMUN THEORY, V84, P486, DOI DOI 10.1140/EPJB/E2014-40805-2
[34]   Quantitative Analysis of Culture Using Millions of Digitized Books [J].
Michel, Jean-Baptiste ;
Shen, Yuan Kui ;
Aiden, Aviva Presser ;
Veres, Adrian ;
Gray, Matthew K. ;
Pickett, Joseph P. ;
Hoiberg, Dale ;
Clancy, Dan ;
Norvig, Peter ;
Orwant, Jon ;
Pinker, Steven ;
Nowak, Martin A. ;
Aiden, Erez Lieberman .
SCIENCE, 2011, 331 (6014) :176-182
[35]   Beyond the Zipf-Mandelbrot law in quantitative linguistics [J].
Montemurro, MA .
PHYSICA A, 2001, 300 (3-4) :567-578
[36]   Power laws, Pareto distributions and Zipf's law [J].
Newman, MEJ .
CONTEMPORARY PHYSICS, 2005, 46 (05) :323-351
[37]   The evolution of language [J].
Nowak, MA ;
Krakauer, DC .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (14) :8028-8033
[38]   Self-organization of progress across the century of physics [J].
Perc, Matjaz .
SCIENTIFIC REPORTS, 2013, 3
[39]   Evolution of the most common English words and phrases over the centuries [J].
Perc, Matjaz .
JOURNAL OF THE ROYAL SOCIETY INTERFACE, 2012, 9 (77) :3323-3328
[40]   Activity driven modeling of time varying networks [J].
Perra, N. ;
Goncalves, B. ;
Pastor-Satorras, R. ;
Vespignani, A. .
SCIENTIFIC REPORTS, 2012, 2