Towards a Balanced Named Entity Corpus for Dutch

被引:0
作者
Desmet, Bart [1 ,2 ]
Hoste, Veronique [1 ,2 ]
机构
[1] Univ Coll Ghent, Language & Translat Technol Team, B-9000 Ghent, Belgium
[2] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium
来源
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2010年
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the remainder of the SoNaR corpus. To this end, experiments with various classification algorithms (MBL, SVM, CRF) and features have been carried out and evaluated.
引用
收藏
页数:7
相关论文
共 50 条
[41]   Named Entity Linking in English-Czech Parallel Corpus [J].
Neverilova, Zuzana ;
Zizkova, Hana .
TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 :147-158
[42]   Corpus Construction for Named-Entity and Entity Relations for Electronic Medical Records of Cardiovascular Disease [J].
Chang, Hongyang ;
Zan, Hongying ;
Zhang, Shuai ;
Zhao, Bingfei ;
Zhang, Kunli .
HEALTH INFORMATION PROCESSING, CHIP 2022, 2023, 1772 :3-18
[43]   Emerging Named Entity Recognition on Retrieval Features in an Affective Computing Corpus [J].
Nawroth, Christian ;
Engel, Felix ;
Mc Kevitt, Paul ;
Hemmje, Matthias L. .
2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, :2860-2868
[44]   Building a Named Entity Annotated Bilingual English-Vietnamese Corpus [J].
Tuan-An Dao ;
Hung-Thinh Truong ;
Long Nguyen ;
Dien Dinh .
PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, :61-66
[45]   BanglaBioMed: A Biomedical Named-Entity Annotated Corpus for Bangla (Bengali) [J].
Sazzed, Salim .
PROCEEDINGS OF THE 21ST WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2022), 2022, :323-329
[46]   Towards Improving Neural Named Entity Recognition with Gazetteers [J].
Liu, Tianyu ;
Yao, Jin-Ge ;
Lin, Chin-Yew .
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, :5301-5307
[47]   LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain [J].
Pais, Vasile ;
Mitrofan, Maria ;
Gasan, Carol Luca ;
Ianov, Alexandru ;
Ghita, Corvin ;
Coneschi, Vlad Silviu ;
Onut, Andrei .
SEMANTIC WEB, 2024, 15 (03) :831-844
[48]   A Method for Building a Labeled Named Entity Recognition Corpus Using Ontologies [J].
Ngoc-Trinh Vu ;
Van-Hien Tran ;
Thi-Huyen-Trang Doan ;
Hoang-Quynh Le ;
Mai-Vu Tran .
ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING, 2015, 358 :141-149
[49]   Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications [J].
Kanwal, Safia ;
Malik, Kamran ;
Shahzad, Khurram ;
Aslam, Faisal ;
Nawaz, Zubair .
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
[50]   Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT [J].
Jarrar, Mustafa ;
Khalilia, Mohammed ;
Ghanem, Sana .
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, :3626-3636