Zero-inflated beta distribution applied to word frequency and lexical dispersion in corpus linguistics

被引:5
作者
Burch, Brent [1 ]
Egbert, Jesse [2 ]
机构
[1] No Arizona Univ, Dept Math & Stat, Flagstaff, AZ 86011 USA
[2] No Arizona Univ, Appl Linguist Program, Dept English, Flagstaff, AZ 86011 USA
关键词
British National Corpus; mixture distribution; ranking words; word usage; zero-inflated beta distribution; PROPORTIONS; MODELS;
D O I
10.1080/02664763.2019.1636941
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Corpus linguistics is the study of language as expressed in a body of texts or documents. The relative frequency of a word within a text and the dispersion of the word across the collection of texts provide information about the word's prominence and diffusion, respectively. In practice, people tend to use a relatively small number of words in a language's inventory of words and thus a large number of words in the lexicon are rarely employed. The zero-inflated beta distribution enables one to model the relative frequency of a word in a text since some texts may not even contain the word under study. In this paper, the expectation of a word's prominence and dispersion are defined under the zero-inflated beta model. Estimates of a word's prominence and dispersion are computed for words in the British National Corpus 1994 (BNC), a 100 million word collection of written and spoken language of a wide range of British English. The relationship between a word's prominence and dispersion is discussed as well as measures that are functions of both prominence and dispersion.
引用
收藏
页码:337 / 353
页数:17
相关论文
共 33 条
[2]  
[Anonymous], 1994, Kendall' s Advanced Theory of Statistics
[3]  
[Anonymous], 2008, Int J Corpus Linguist, V13, P403, DOI DOI 10.1075/IJCL.13.4.02GRI
[4]  
[Anonymous], 1935, The psycho-biology of language
[5]  
Burch Brent., 2017, Journal of Research Design and Statistics in Linguistics and Communication Science, V3, P189, DOI 10.1558/jrds.33066
[7]  
Carroll J. B., 1970, ETS RES REP SER, V1970, DOI [10.1002/j.2333-8504.1970.tb00778.x, DOI 10.1002/J.2333-8504.1970.TB00778.X]
[8]   Integral inequalities of Hermite-Hadamard type for functions whose third derivatives are convex [J].
Chun, Ling ;
Qi, Feng .
JOURNAL OF INEQUALITIES AND APPLICATIONS, 2013,
[9]   Regression analysis of proportions in finance with self selection [J].
Cook, Douglas O. ;
Kieschnick, Robert ;
McCullough, B. D. .
JOURNAL OF EMPIRICAL FINANCE, 2008, 15 (05) :860-867
[10]  
Egbert J., 2019, INT J CORPUS LINGUIS