A study of BERT-based methods for formal citation identification of scientific data

被引:2
作者
Yang, Ning [1 ,2 ,3 ]
Zhang, Zhiqiang [1 ,2 ,3 ]
Huang, Feihu [4 ]
机构
[1] Chinese Acad Sci, Chengdu Lib, Chengdu 610041, Peoples R China
[2] Chinese Acad Sci, Informat Ctr, Chengdu 610041, Peoples R China
[3] Univ Chinese Acad Sci, Sch Econ & Management, Dept Lib Informat & Arch Management, Beijing 100190, Peoples R China
[4] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
关键词
BERT; Research data; Formal data citation; Identification methods;
D O I
10.1007/s11192-023-04833-z
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
A study on scientific data citation is crucial to promote data sharing and is the basis for the examination of scientific data measurement and analysis. To this end, it is necessary to identify and label data reference information. Currently, there are many supervised methods for entity recognition and relationship extraction of diseases, drugs, proteins, symptoms, etc., but they have not discussed the effectiveness of scientific data recognition. To fill this gap, the effectiveness of the classical machine learning model and the deep learning model on recognizing scientific data citation are discussed in this study. In experiments, this study took the full text of scientific and technical papers as the research object, conducted annotated citation classification based on rules and manual recognition of their references to form a dataset. The results of the empirical study showed that: (1) the methods used in this paper can achieve automatic identification and extraction of data citations and can address the problem of automating the construction of citation relationships between scientific and technical literature and scientific data; (2) the BERT-based models have the optimal effectiveness in the recognition task of scientific data citation, especially the BioBERT and SciBERT; (3) the full-text information has a crucial impact on the recognition results.
引用
收藏
页码:5865 / 5881
页数:17
相关论文
共 34 条
[1]  
[Anonymous], 2019, Data Citation Guidelines for Earth Science Data. Ver. 2", P1, DOI [DOI 10.6084/M9.FIGSHARE.7640426, DOI 10.6084/M9.FIGSHARE.8441816]
[2]  
Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[3]   A neural probabilistic language model [J].
Bengio, Y ;
Ducharme, R ;
Vincent, P ;
Jauvin, C .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) :1137-1155
[4]  
Borgman CL, 2015, BIG DATA, LITTLE DATA, NO DATA: SCHOLARSHIP IN THE NETWORKED WORLD, P1
[5]  
Carletta J, 1996, COMPUT LINGUIST, V22, P249
[6]  
Chapman B., 2000, SIGBIO Newsletter, V20, P15, DOI 10.1145/360262.360268
[7]  
Cui B-G., 2010, An improved hidden markov model for literature metadata extraction
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Duke M., 2012, 23 INT CODATA C
[10]  
Ghavimi B, 2016, Arxiv, DOI arXiv:1603.01774