EduNER: a Chinese named entity recognition dataset for education research

被引:0
作者
Xu Li
Chengkun Wei
Zhuoren Jiang
Wenlong Meng
Fan Ouyang
Zihui Zhang
Wenzhi Chen
机构
[1] Zhejiang University,College of Computer Science and Technology
[2] Zhejiang University,School of Public Affairs
[3] Zhejiang University,College of Education
[4] Zhejiang University,Information Technology Center
来源
Neural Computing and Applications | 2023年 / 35卷
关键词
Chinese named entity recognition; Dataset; Benchmark; Education;
D O I
暂无
中图分类号
学科分类号
摘要
A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012–2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models.
引用
收藏
页码:17717 / 17731
页数:14
相关论文
共 47 条
[1]  
Chen CM(2021)An instant perspective comparison system to facilitate learners’ discussion effectiveness in an online discussion process Comput Educat 164 037-46
[2]  
Tsao HW(1960)A coefficient of agreement for nominal scales Educ Psychol Measur 20 37-3546
[3]  
Cohen J(2018)D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information Bioinformatics 34 3539-10
[4]  
Dang TH(2014)NCBI disease corpus: a resource for disease name recognition and concept normalization J Biomed Inf 47 1-19
[5]  
Le HQ(2017)Automatically generating effective search queries directly from community question-answering questions for finding related questions Expert Syst Appl 77 11-i182
[6]  
Nguyen TM(2021)Classification of acoustical signals by combining active learning strategies with semi-supervised learning schemes Neural Comput Appl 19 i180-1240
[7]  
Dogan RI(2003)GENIA corpus—a semantically annotated corpus for bio-textmining Bioinformatics 36 1234-37
[8]  
Leaman R(2020)BioBERT: a pre-trained biomedical language representation model for biomedical text mining Bioinformatics 465 26-70
[9]  
Lu Z(2021)A segment enhanced span-based model for nested named entity recognition Neurocomputing 34 50-53
[10]  
Figueroa A(2022)A survey on deep learning for named entity recognition IEEE Trans Knowl Data Eng 473 37-548