Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus

被引:10
作者
Koplenig, Alexander [1 ]
Kupietz, Marc [2 ]
Wolfer, Sascha [1 ]
机构
[1] Leibniz Inst German Language IDS, Dept Lex Studies, D-68161 Mannheim, Germany
[2] Leibniz Inst German Language IDS, Dept Digital Linguist, Mannheim, Germany
关键词
Compression; Corpus linguistics; Information theory; Large-scale corpora; N-gram modeling; Uniform information density;
D O I
10.1111/cogs.13090
中图分类号
B84 [心理学];
学科分类号
04 ; 0402 ;
摘要
In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of similar to 43 billion words. We only find very little support for the primary data point reported by PT&G.
引用
收藏
页数:10
相关论文
共 24 条
[1]  
[Anonymous], LEIBNIZINSTITUTE GER
[2]  
[Anonymous], 1996, P 34 ANN M ASS COMP, DOI 10.3115/981863.981904
[3]  
Brants T., 2006, GOOGLE WEB 1T 5 GRAM
[4]  
Brants T., 2007, Large Language Models in Machine Translation
[5]   DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].
CLEARY, JG ;
WITTEN, IH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402
[6]  
Ferrer-i-Cancho R, 2016, P LEID WORKSH CAPT P
[7]  
Jurafsky D, 2019, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, V3rd
[8]   A Machine Learning Perspective on Predictive Coding with PAQ8 [J].
Knoll, Byron ;
de Freitas, Nando .
2012 DATA COMPRESSION CONFERENCE (DCC), 2012, :377-386
[9]  
Kupietz M., 2018, P 11 INT C LANGUAGE
[10]  
Kupietz M, 2010, LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1848