EduNER: a Chinese named entity recognition dataset for education research

被引:7
作者
Li, Xu [1 ]
Wei, Chengkun [1 ]
Jiang, Zhuoren [2 ]
Meng, Wenlong [1 ]
Ouyang, Fan [3 ]
Zhang, Zihui [4 ]
Chen, Wenzhi [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China
[2] Zhejiang Univ, Sch Publ Affairs, 866 Yuhangtang Rd, Hangzhou 310058, Zhejiang, Peoples R China
[3] Zhejiang Univ, Coll Educ, 866 Yuhangtang Rd, Hangzhou 310058, Zhejiang, Peoples R China
[4] Zhejiang Univ, Informat Technol Ctr, 866 Yuhangtang Rd, Hangzhou 310058, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Chinese named entity recognition; Dataset; Benchmark; Education; AGREEMENT;
D O I
10.1007/s00521-023-08635-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012-2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models.
引用
收藏
页码:17717 / 17731
页数:15
相关论文
共 51 条
[41]  
Truong TH, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P2146
[42]  
Viera AJ, 2005, FAM MED, V37, P360
[43]  
Wang XY, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), P1800
[44]  
Wu S, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), P1529
[45]  
Yan Hang, 2019, arXiv
[46]   Targeting Chromatin Remodeling in Inflammation and Fibrosis [J].
Yang, J. ;
Tian, B. ;
Brasier, A. R. .
CHROMATIN PROTEINS AND TRANSCRIPTION FACTORS AS THERAPEUTIC TARGETS, 2017, 107 :1-36
[47]  
Zhang J. P., 2016, Modern education technology, V4
[48]  
Zhang S, 2021, ARXIV, DOI DOI 10.48550/ARXIV.2109.03784
[49]  
Zhang Y, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P1554
[50]   Named Entity Recognition in Electric Power Metering Domain Based on Attention Mechanism [J].
Zheng, Kaihong ;
Sun, Lingyun ;
Wang, Xin ;
Zhou, Shangli ;
Li, Hanbin ;
Li, Sheng ;
Zeng, Lukun ;
Gong, Qihang .
IEEE ACCESS, 2021, 9 :152564-152573