Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

被引:5
作者
Sanguinetti, Manuela [1 ]
Bosco, Cristina [2 ]
Cassidy, Lauren [3 ]
Cetinoglu, Ozlem [4 ]
Cignarella, Alessandra Teresa [2 ,5 ]
Lynn, Teresa [3 ]
Rehbein, Ines [6 ]
Ruppenhofer, Josef [7 ]
Seddah, Djame [8 ]
Zeldes, Amir [9 ]
机构
[1] Univ Cagliari, Dipartimento Matemat & Informat, Cagliari, Italy
[2] Univ Torino, Dipartimento Informat, Turin, Italy
[3] Dublin City Univ, ADAPT Ctr, Dublin 9, Ireland
[4] Univ Stuttgart, IMS, Stuttgart, Germany
[5] Univ Politecn Valencia, PRHLT Res Ctr, Valencia, Spain
[6] Univ Mannheim, Mannheim, Germany
[7] Leibniz Inst Deutsch Sprache, Mannheim, Germany
[8] INRIA, Paris, France
[9] Georgetown Univ, Washington, DC USA
基金
爱尔兰科学基金会;
关键词
Web; Social media; Treebanks; Universal Dependencies; Annotation guidelines; UGC; CORPUS;
D O I
10.1007/s10579-022-09581-9
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks-based on available literature-along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
引用
收藏
页码:493 / 544
页数:52
相关论文
共 88 条
[1]  
Albogamy F., 2017, INT C REC ADV NAT LA, P46
[2]  
[Anonymous], 2012, P 8 INT C LANG RES E
[3]  
[Anonymous], 1993, COMPUT LINGUIST, DOI DOI 10.21236/ADA273556
[4]  
[Anonymous], 2017, P 15 INT WORKSH TREE
[5]  
Aufrant Lauriane, 2017, P CONLL 2017 SHAR TA, P163, DOI 10.18653/v1/K17-3017
[6]  
Azzi A.A., 2019, P 1 WORKSH FIN TECHN, P74
[7]  
Balahur A., 2013, P 4 WORKSH COMP APPR, P120
[8]  
Behzad S., 2020, P 12 WEB CORP WORKSH, P50
[9]  
Bhat I., 2018, P 2018 C N AM CHAPT, V1, P987, DOI [10.18653/v1/N18-1090, DOI 10.18653/V1/N18-1090]
[10]  
Bjorkelund Anders., 2017, P CONLL 2017 SHAR TA, P40, DOI DOI 10.18653/V1/K17-3004