ViMedNER: A Medical Named Entity Recognition Dataset for Vietnamese

被引:0
作者
Duong, Pham Van [1 ,2 ]
Trinh, Tien-Dat [2 ]
Nguyen, Minh-Tien [3 ]
Vu, Huy-The [3 ]
Pham, Minh-Chuan [3 ]
Tuan, Tran Manh [4 ]
Son, Le Hoang [5 ,6 ]
机构
[1] School of Information Communication and Technology, Hanoi University of Science and Technology, Hanoi
[2] ICT Department, FPT University, Hanoi
[3] Faculty of Information Technology, Hung Yen University of Technology and Education, Hung Yen
[4] Faculty of Computer Science and Engineering, Thuyloi University, Hanoi
[5] VNU Information Technology Institute, Vietnam National University, Hanoi
[6] VNU University of Science, Vietnam National University, Hanoi
关键词
Medical text; Named entity recognition; Pre-trained language model; Vietnamese corpus;
D O I
10.4108/eetinis.v11i3.5221
中图分类号
学科分类号
摘要
Named entity recognition (NER) is one of the most important tasks in natural language processing, which identifies entity boundaries and classifies them into pre-defined categories. In literature, NER systems have been developed for various languages but limited works have been conducted for Vietnamese. This mainly comes from the limitation of available and high-quality annotated data, especially for specific domains such as medicine and healthcare. In this paper, we introduce a new medical NER dataset, named ViMedNER, for recognizing Vietnamese medical entities. Unlike existing works designed for common or too-specific entities, we focus on entity types that can be used in common diagnostic and treatment scenarios, including disease names, the symptoms of the diseases, the cause of the diseases, the diagnostic, and the treatment. These entities facilitate the diagnosis and treatment of doctors for common diseases. Our dataset is collected from four well-known Vietnamese websites that are professional in terms of drag selling and disease diagnostics and annotated by domain experts with high agreement scores. To create benchmark results, strong NER baselines based on pre-trained language models including PhoBERT, XLM-R, ViDeBERTa, ViPubMedDeBERTa, and ViHealthBERT are implemented and evaluated on the dataset. Experiment results show that the performance of XLM-R is consistently better than that of the other pre-trained language models. Furthermore, additional experiments are conducted to explore the behavior of the baselines and the characteristics of our dataset. © (2023), (European Alliance for Innovation). All Rights Reserved.
引用
收藏
相关论文
共 57 条
  • [1] ANGELI G., Premkumar M.J., Manning C.D., Leveraging linguistic structure for open domain information extraction, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 344-354, (2015)
  • [2] Lample G., Ballesteros M., Subramanian S., Kawakami K., Dyer C., Neural architectures for named entity recognition, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260-270, (2016)
  • [3] Li X., Feng J., Meng Y., Han Q., Wu F., Li J., A unified mrc framework for named entity recognition, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5849-5859, (2020)
  • [4] Puccetti G., Chiarello F., Fantoni G., A simple and fast method for named entity context extraction from patents, Expert Systems with Applications, 184, 2021, (2021)
  • [5] Sang E., Kim T., Meulder F.D., Introduction to the conll-2003 shared task: Language-independent named entity recognition, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, (2003)
  • [6] Li J., Sun Y., Johnson R.J., Sciaky D., Wei C.H., Leaman R., Davis A.P., Et al., Biocreative v cdr task corpus: a resource for chemical disease relation extraction, Database 2016, (2016)
  • [7] Zhang Z., Han X., Liu Z., Jiang X., Sun M., Liu Q., Ernie: Enhanced language representation with informative entities, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441-1451, (2019)
  • [8] Cheng P., Erk K., Attending to entities for better text understanding, Proceedings of the AAAI conference on artificial intelligence, 34, pp. 7554-7561, (2020)
  • [9] Guo J., Xu G., Cheng X., Li H., Named entity recognition in query, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 267-274, (2009)
  • [10] Aone C., A trainable summarizer with knowledge acquired from robust nlp techniques, Advances in automatic text summarization, pp. 71-80, (1999)