Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

被引:5
作者
Sanguinetti, Manuela [1 ]
Bosco, Cristina [2 ]
Cassidy, Lauren [3 ]
Cetinoglu, Ozlem [4 ]
Cignarella, Alessandra Teresa [2 ,5 ]
Lynn, Teresa [3 ]
Rehbein, Ines [6 ]
Ruppenhofer, Josef [7 ]
Seddah, Djame [8 ]
Zeldes, Amir [9 ]
机构
[1] Univ Cagliari, Dipartimento Matemat & Informat, Cagliari, Italy
[2] Univ Torino, Dipartimento Informat, Turin, Italy
[3] Dublin City Univ, ADAPT Ctr, Dublin 9, Ireland
[4] Univ Stuttgart, IMS, Stuttgart, Germany
[5] Univ Politecn Valencia, PRHLT Res Ctr, Valencia, Spain
[6] Univ Mannheim, Mannheim, Germany
[7] Leibniz Inst Deutsch Sprache, Mannheim, Germany
[8] INRIA, Paris, France
[9] Georgetown Univ, Washington, DC USA
基金
爱尔兰科学基金会;
关键词
Web; Social media; Treebanks; Universal Dependencies; Annotation guidelines; UGC; CORPUS;
D O I
10.1007/s10579-022-09581-9
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks-based on available literature-along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
引用
收藏
页码:493 / 544
页数:52
相关论文
共 88 条
[31]  
Kirov C, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P1868
[32]  
Kirov C, 2016, LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3121
[33]  
Kong L, 2014, P 2014 C EMP METH NA, P1001, DOI DOI 10.3115/V1/D14-1108
[34]  
Lacheret A., 2014, P 9 INT C LANG RES E, P295
[35]  
Leung H., 2016, P 12 WORKSH AS LANG, P20
[36]  
Liu Yijia, 2018, Long Papers, V1, P965
[37]  
Loper Edward, 2002, COL ACL 2006 21 INT
[38]  
Luotolahti J., 2015, P 3 INT C DEP LING D, P211
[39]  
Lynn T., 2015, PROC WORKSHOP NOISY, P1
[40]  
Lynn Teresa, 2019, P CELT LANG TECHN WO, P32