Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice

被引:38
作者
Brysbaert, Marc [1 ]
Diependaele, Kevin [1 ]
机构
[1] Univ Ghent, Dept Expt Psychol, B-9000 Ghent, Belgium
关键词
Word frequency; Laplace transformation; Good-Turing algorithm; Zero frequency;
D O I
10.3758/s13428-012-0270-5
中图分类号
B841 [心理学研究方法];
学科分类号
040201 ;
摘要
In a critical review of the heuristics used to deal with zero word frequencies, we show that four are suboptimal, one is good, and one may be acceptable. The four suboptimal strategies are discarding words with zero frequencies, giving words with zero frequencies a very low frequency, adding 1 to the frequency per million, and making use of the Good-Turing algorithm. The good algorithm is the Laplace transformation, which consists of adding 1 to each frequency count and increasing the total corpus size by the number of word types observed. A strategy that may be acceptable is to guess the frequency of absent words on the basis of other corpora and then increasing the total corpus size by the estimated summed frequency of the missing words. A comparison with the lexical decision times of the English Lexicon Project and the British Lexicon Project suggests that the Laplace transformation gives the most useful estimates ( in addition to being easy to calculate). Therefore, we recommend it to researchers.
引用
收藏
页码:422 / 430
页数:9
相关论文
共 16 条
[1]  
[Anonymous], J EXPT PSYCHOL HUMAN
[2]  
Baayen R. H., 2001, WORD FREQUENCY DISTR, V18
[3]   The English Lexicon Project [J].
Balota, David A. ;
Yap, Melvin J. ;
Cortese, Michael J. ;
Hutchison, Keith A. ;
Kessler, Brett ;
Loftis, Bjorn ;
Neely, James H. ;
Nelson, Douglas L. ;
Simpson, Greg B. ;
Treiman, Rebecca .
BEHAVIOR RESEARCH METHODS, 2007, 39 (03) :445-459
[4]   The Word Frequency Effect A Review of Recent Developments and Implications for the Choice of Frequency Estimates in German [J].
Brysbaert, Marc ;
Buchmeier, Matthias ;
Conrad, Markus ;
Jacobs, Arthur M. ;
Boelte, Jens ;
Boehl, Andrea .
EXPERIMENTAL PSYCHOLOGY, 2011, 58 (05) :412-424
[5]   Assessing the usefulness of Google Books' word frequencies for psycholinguistic research on word processing [J].
Brysbaert, Marc ;
Keuleers, Emmanuel ;
New, Boris .
FRONTIERS IN PSYCHOLOGY, 2011, 2
[6]   Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English [J].
Brysbaert, Marc ;
New, Boris .
BEHAVIOR RESEARCH METHODS, 2009, 41 (04) :977-990
[7]  
Gale W. A., 1995, Journal of Quantitative Linguistics, V2, P217, DOI [10.1080/09296179508590051, DOI 10.1080/09296179508590051]
[8]  
Gulikers L., 1995, CELEX LEXICAL DATABA
[9]  
Jurafsky D., 2009, Speech and Language Processing, DOI DOI 10.1162/JMLR.2003.3.4-5.993
[10]   The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words [J].
Keuleers, Emmanuel ;
Lacey, Paula ;
Rastle, Kathleen ;
Brysbaert, Marc .
BEHAVIOR RESEARCH METHODS, 2012, 44 (01) :287-304