SciCN: A Scientific Dataset for Chinese Named Entity Recognition

被引:0
作者
Yang, Jing [1 ]
Ji, Bin [1 ]
Li, Shasha [1 ]
Ma, Jun [1 ]
Yu, Jie [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Changsha 410073, Peoples R China
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2024年 / 78卷 / 03期
关键词
Named entity recognition; dataset; scientific information extraction; lexicon;
D O I
10.32604/cmc.2023.035594
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Named entity recognition (NER) is a fundamental task of information extraction (IE), and it has attracted considerable research attention in recent years. The abundant annotated English NER datasets have significantly promoted the NER research in the English field. By contrast, much fewer efforts are made to the Chinese NER research, especially in the scientific domain, due to the scarcity of Chinese NER datasets. To alleviate this problem, we present a Chinese scientific NER dataset-SciCN, which contains entity annotations of titles and abstracts derived from 3,500 scientific papers. We manually annotate a total of 62,059 entities, and these entities are classified into six types. Compared to English scientific NER datasets, SciCN has a larger scale and is more diverse, for it not only contains more paper abstracts but these abstracts are derived from more research fields. To investigate the properties of SciCN and provide baselines for future research, we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN. Experimental results show that SciCN is more challenging than other Chinese NER datasets. In addition, previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models. Motivated by this fact, we provide a scientific domain-specific lexicon. Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains. We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research. The dataset and lexicon are available at: https://github.com/yangjingla/ SciCN.git.
引用
收藏
页码:4303 / 4315
页数:13
相关论文
共 35 条
[11]  
Gbor K., 2018, P 12 INT WORKSH SEM, P679, DOI [10.18653/v1/S18-1111, DOI 10.18653/V1/S18-1111]
[12]  
Hope T, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P4489
[13]  
Hou YF, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P707
[14]  
Huang ZH, 2015, Arxiv, DOI arXiv:1508.01991
[15]  
Lahav D, 2022, AAAI CONF ARTIF INTE, P11982
[16]   BioBERT: a pre-trained biomedical language representation model for biomedical text mining [J].
Lee, Jinhyuk ;
Yoon, Wonjin ;
Kim, Sungdong ;
Kim, Donghyeon ;
Kim, Sunkyu ;
So, Chan Ho ;
Kang, Jaewoo .
BIOINFORMATICS, 2020, 36 (04) :1234-1240
[17]  
Levow Gina-Anne, 2006, P 5 SIGHAN WORKSH CH, P108
[18]  
Li X., 2020, P 58 ANN M ASS COMPU, P6836, DOI DOI 10.18653/V1/2020.ACL-MAIN.611
[19]  
Lin BYC, 2019, PROCEEDINGS OF THE 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: SYSTEM DEMONSTRATIONS, (ACL 2019), P58
[20]  
Lo K., 2020, P 58 ANN M ASS COMPU, P4969, DOI [DOI 10.18653/V1/2020.ACL-MAIN.447, 10.18653/ v1/2020.acl-main.447]