Towards a Balanced Named Entity Corpus for Dutch

被引:0
作者
Desmet, Bart [1 ,2 ]
Hoste, Veronique [1 ,2 ]
机构
[1] Univ Coll Ghent, Language & Translat Technol Team, B-9000 Ghent, Belgium
[2] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium
来源
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2010年
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the remainder of the SoNaR corpus. To this end, experiments with various classification algorithms (MBL, SVM, CRF) and features have been carried out and evaluated.
引用
收藏
页数:7
相关论文
共 50 条
  • [31] Named entity recognition through corpus transformation and system combination
    Troyano, JA
    Carrillo, V
    Enríquez, F
    Galán, FJ
    [J]. ADVANCES IN NATURAL LANGUAGE PROCESSING, 2004, 3230 : 255 - 266
  • [32] Biomedical named entity extraction: some issues of corpus compatibilities
    Ekbal, Asif
    Saha, Sriparna
    Sikdar, Utpal Kumar
    [J]. SPRINGERPLUS, 2013, 2 : 1 - 12
  • [33] An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition
    Hoxha, Klesti
    Baxhaku, Artur
    [J]. CYBERNETICS AND INFORMATION TECHNOLOGIES, 2018, 18 (01) : 95 - 108
  • [34] GENETAG: a tagged corpus for gene/protein named entity recognition
    Tanabe, L
    Xie, N
    Thom, LH
    Matten, W
    Wilbur, WJ
    [J]. BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
  • [35] ChisNERE: a premodern Chinese corpus with named entity and relation annotation
    Tang, Xuemei
    Deng, Zekun
    Wang, Jun
    Su, Qi
    [J]. DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2025,
  • [36] System evaluation on a named entity corpus from clinical notes
    Kipper-Schuler, Karin
    Kaggal, Vinod
    Masanz, James
    Ogren, Philip
    Savova, Guergana
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3007 - 3011
  • [37] Corpus of Carbonate Platforms with Lexical Annotations for Named Entity Recognition
    Hu, Zhichen
    Ren, Huali
    Jiang, Jielin
    Cui, Yan
    Hu, Xiumian
    Xu, Xiaolong
    [J]. CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2023, 135 (01): : 91 - 108
  • [38] CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes
    Nastou, Katerina
    Koutrouli, Mikaela
    Pyysalo, Sampo
    Jensen, Lars Juhl
    [J]. BIOINFORMATICS ADVANCES, 2024, 4 (01):
  • [39] Named Entity Corpus Construction using Wikipedia and DBpedia Ontology
    Hahm, Younggyun
    Park, Jungyeul
    Lim, Kyungtae
    Kim, Youngsik
    Hwang, Dosam
    Choi, Key-Sun
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2565 - 2569
  • [40] Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus
    Macken, Lieve
    De Clercq, Orphee
    Paulussen, Hans
    [J]. META, 2011, 56 (02) : 374 - 390