Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

被引:0
作者
Manuela Sanguinetti
Cristina Bosco
Lauren Cassidy
Özlem Çetinoğlu
Alessandra Teresa Cignarella
Teresa Lynn
Ines Rehbein
Josef Ruppenhofer
Djamé Seddah
Amir Zeldes
机构
[1] Università degli Studi di Cagliari,Dipartimento di Matematica e Informatica
[2] Università degli Studi di Torino,Dipartimento di Informatica
[3] Dublin City University,ADAPT Centre
[4] University of Stuttgart,IMS
[5] Universitat Politècnica de València,PRHLT Research Center
[6] University of Mannheim,undefined
[7] Leibniz-Institut für Deutsche Sprache,undefined
[8] INRIA,undefined
[9] Georgetown University,undefined
来源
Language Resources and Evaluation | 2023年 / 57卷
关键词
Web; Social media; Treebanks; Universal Dependencies; Annotation guidelines; UGC;
D O I
暂无
中图分类号
学科分类号
摘要
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks—based on available literature—along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
引用
收藏
页码:493 / 544
页数:51
相关论文
共 12 条
  • [1] Marcus Mitchell(1993)Building a Large Annotated Corpus of English: The Penn Treebank Computational Linguistics 19 313-330
  • [2] Santorini Beatrice(2016)Multi-lingual opinion mining on YouTube Information Processing & Management 52 46-60
  • [3] Marcinkiewicz Mary Ann(2017)Universal, unsupervised (rule-based), uncovered sentiment analysis Knowledge-Based Systems 118 45-55
  • [4] Severyn Aliaksei(2017)The GUM Corpus: Creating Multilayer Resources in the Classroom Language Resources and Evaluation 51 581-612
  • [5] Moschitti Alessandro(undefined)undefined undefined undefined undefined-undefined
  • [6] Uryupina Olga(undefined)undefined undefined undefined undefined-undefined
  • [7] Plank Barbara(undefined)undefined undefined undefined undefined-undefined
  • [8] Filippova Katja(undefined)undefined undefined undefined undefined-undefined
  • [9] Vilares David(undefined)undefined undefined undefined undefined-undefined
  • [10] Gómez-Rodríguez Carlos(undefined)undefined undefined undefined undefined-undefined