Thai Nested Named Entity Recognition Corpus

被引:0
|
作者
Buaphet, Weerayut [1 ]
Udomcharoenchaikit, Can [1 ]
Limkonchotiwat, Peerat [1 ]
Rutherford, Attapol T. [2 ]
Nutanong, Sarana [1 ]
机构
[1] VISTEC, Sch Informat Sci & Technol, Pa Yup Nai, Thailand
[2] Chulalongkorn Univ, Dept Linguist, Bangkok, Thailand
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022) | 2022年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting edge N-NER models with the state-of-the-art accuracy in English and (ii) baseline methods based on well-known language model architectures. From the experimental results, we obtain two key findings. First, all models produce poor F1 scores in the tail region of the class distribution. There is little or no performance improvement provided by these models with respect to the baseline methods with our Thai dataset. These findings suggest that further investigation is required to make a multilingual N-NER solution that works well across different languages. The dataset and code are available at: github.com/vistec-AI/Thai-NNER.git
引用
收藏
页码:1473 / 1486
页数:14
相关论文
共 50 条
  • [1] Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT
    Jarrar, Mustafa
    Khalilia, Mohammed
    Ghanem, Sana
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3626 - 3636
  • [2] Named Entity Recognition Modeling for the Thai Language from a Disjointedly Labeled Corpus
    Suriyachay, Kitiya
    Sornlertlamvanich, Virach
    2018 5TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS: CONCEPTS, THEORY AND APPLICATIONS (ICAICTA 2018), 2018, : 30 - 35
  • [3] Nested Named Entity Recognition: A Survey
    Wang, Yu
    Tong, Hanghang
    Zhu, Ziye
    Li, Yun
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (06)
  • [4] Named Entity Recognition for Thai Historical Data
    Laosen, Nasith
    Laosen, Kanjana
    Paklao, Thummarat
    2024 21ST INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING, JCSSE 2024, 2024, : 528 - 533
  • [5] A Controlled Attention for Nested Named Entity Recognition
    Chen, Yanping
    Huang, Rong
    Pan, Lijun
    Huang, Ruizhang
    Zheng, Qinghua
    Chen, Ping
    COGNITIVE COMPUTATION, 2023, 15 (01) : 132 - 145
  • [6] A Controlled Attention for Nested Named Entity Recognition
    Yanping Chen
    Rong Huang
    Lijun Pan
    Ruizhang Huang
    Qinghua Zheng
    Ping Chen
    Cognitive Computation, 2023, 15 : 132 - 145
  • [7] Uzbek news corpus for named entity recognition
    Yusufu, Aizihaierjiang
    Aziz, Kamran
    Yusufu, Aizierguli
    Ainiwaer, Abidan
    Li, Fei
    Ji, Donghong
    LANGUAGE RESOURCES AND EVALUATION, 2024,
  • [8] A Twitter Corpus for Named Entity Recognition in Turkish
    Carik, Buse
    Yeniterzi, Reyyan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4546 - 4551
  • [9] A Finnish news corpus for named entity recognition
    Teemu Ruokolainen
    Pekka Kauppinen
    Miikka Silfverberg
    Krister Lindén
    Language Resources and Evaluation, 2020, 54 : 247 - 272
  • [10] A Finnish news corpus for named entity recognition
    Ruokolainen, Teemu
    Kauppinen, Pekka
    Silfverberg, Miikka
    Linden, Krister
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (01) : 247 - 272