On the Distribution of Deep Clausal Embeddings: A Large Cross-linguistic Study

被引:0
作者
Blasi, Damian E. [1 ,2 ]
Cotterell, Ryan [3 ]
Wolf-Sonkin, Lawrence [4 ]
Stoll, Sabine [1 ]
Bickel, Balthasar [1 ]
Baroni, Marco [5 ,6 ]
机构
[1] Univ Zurich, Zurich, Switzerland
[2] Max Planck Inst Sci Human Hist, Jena, Germany
[3] Univ Cambridge, Cambridge, England
[4] Johns Hopkins Univ, Baltimore, MD 21218 USA
[5] Facebook AI Res, Menlo Pk, CA USA
[6] Catalan Inst Res & Adv Studies, Barcelona, Spain
来源
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019) | 2019年
关键词
UNIVERSALS; MYTH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Embedding a clause inside another ("the girl [who likes cars [that run fast]] has arrived") is a fundamental resource that has been argued to be a key driver of linguistic expressiveness. As such, it plays a central role in fundamental debates on what makes human language unique, and how they might have evolved. Empirical evidence on the prevalence and the limits of embeddings has however been based on either laboratory setups or corpus data of relatively limited size. We introduce here a collection of large, dependency-parsed written corpora in 17 languages, that allow us, for the first time, to capture clausal embedding through dependency graphs and assess their distribution. Our results indicate that there is no evidence for hard constraints on embedding depth: the tail of depth distributions is heavy. Moreover, although deeply embedded clauses tend to be shorter, suggesting processing load issues, complex sentences with many embeddings do not display a bias towards less deep embeddings. Taken together, the results suggest that deep embeddings are not disfavored in written language. More generally, our study illustrates how resources and methods from latest-generation big-data NLP can provide new perspectives on fundamental questions in theoretical linguistics.
引用
收藏
页码:3938 / 3943
页数:6
相关论文
共 25 条
[1]  
[Anonymous], 2013, Compounding in Modern Greek
[2]  
Bei Chao, 2018, P 3 C MACH TRANSL BE, P344
[3]  
Bickel B, 2010, STUD LANG C, V121, P51
[4]  
Bojanowski Piotr, 2017, Trans. Assoc. Comput. Linguist., V5, P135, DOI DOI 10.1162/TACL_A_00051
[5]  
Chomsky N., 1995, The minimalist program
[6]   Power-Law Distributions in Empirical Data [J].
Clauset, Aaron ;
Shalizi, Cosma Rohilla ;
Newman, M. E. J. .
SIAM REVIEW, 2009, 51 (04) :661-703
[7]   A MYTH ABOUT CENTRE-EMBEDDING [J].
DEROECK, A ;
JOHNSON, R ;
KING, M ;
ROSNER, M ;
SAMPSON, G ;
VARILE, N .
LINGUA, 1982, 58 (3-4) :327-340
[8]  
Dozat Timothy, 2017, Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, P20
[9]   The myth of language universals: Language diversity and its importance for cognitive science [J].
Evans, Nicholas ;
Levinson, Stephen C. .
BEHAVIORAL AND BRAIN SCIENCES, 2009, 32 (05) :429-+
[10]   Cultural constraints on grammar and cognition in Piraha - Another look at the design features of human language [J].
Everett, DL .
CURRENT ANTHROPOLOGY, 2005, 46 (04) :621-646