SciCN: A Scientific Dataset for Chinese Named Entity Recognition

被引:0
作者
Yang, Jing [1 ]
Ji, Bin [1 ]
Li, Shasha [1 ]
Ma, Jun [1 ]
Yu, Jie [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Changsha 410073, Peoples R China
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 78卷 / 03期
关键词
Named entity recognition; dataset; scientific information extraction; lexicon;
D O I
10.32604/cmc.2023.035594
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Named entity recognition (NER) is a fundamental task of information extraction (IE), and it has attracted considerable research attention in recent years. The abundant annotated English NER datasets have significantly promoted the NER research in the English field. By contrast, much fewer efforts are made to the Chinese NER research, especially in the scientific domain, due to the scarcity of Chinese NER datasets. To alleviate this problem, we present a Chinese scientific NER dataset-SciCN, which contains entity annotations of titles and abstracts derived from 3,500 scientific papers. We manually annotate a total of 62,059 entities, and these entities are classified into six types. Compared to English scientific NER datasets, SciCN has a larger scale and is more diverse, for it not only contains more paper abstracts but these abstracts are derived from more research fields. To investigate the properties of SciCN and provide baselines for future research, we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN. Experimental results show that SciCN is more challenging than other Chinese NER datasets. In addition, previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models. Motivated by this fact, we provide a scientific domain-specific lexicon. Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains. We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research. The dataset and lexicon are available at: https://github.com/yangjingla/ SciCN.git.
引用
收藏
页码:4303 / 4315
页数:13
相关论文
共 35 条
[1]  
Aniek B., 2021, M.S. Thesis
[2]  
[Anonymous], 2012, P 50 ANN M ASS COMP
[3]  
Augenstein I., 2017, P 11 INT WORKSH SEM, P546, DOI [DOI 10.18653/V1/517-2091, 10.18653/v1/S17-2091]
[4]  
Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[5]  
Bollacker K., 2008, P 2008 ACM SIGMOD IN, P1247, DOI 10.1145/1376616.1376746
[6]   A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].
COHEN, J .
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46
[7]   Pre-Training With Whole Word Masking for Chinese BERT [J].
Cui, Yiming ;
Che, Wanxiang ;
Liu, Ting ;
Qin, Bing ;
Yang, Ziqing .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3504-3514
[8]  
Dernoncourt F., 2017, P 8 INT JOINT C NATU, P308
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Gabor K., 2018, P 12 INT WORKSHOP SE, P679, DOI DOI 10.18653/V1/S18-1111