Research on Fine-Grained Entity Recognition of Ancient Book Based on Syntactic Features and Bert-BiLSTM-MHA-CRF

被引:0
作者
Wu, Shuai [1 ]
Yang, Xiuzhang [2 ,3 ]
He, Lin [1 ]
Gong, Zuoquan [4 ]
机构
[1] College of Information Management, Nanjing Agricultural University, Nanjing
[2] Guizhou Big Data Academy, Guizhou University, Guiyang
[3] School of Cyber Science and Engineering, Wuhan University, Wuhan
[4] School of Information, Guizhou University of Finance and Economics, Guiyang
关键词
Ancient Texts; Bert-BiLSTM-MHA-CRF Syntactic Features; Named Entity Recognition; Pre-trained Model;
D O I
10.11925/infotech.2096-3467.2023.1002
中图分类号
学科分类号
摘要
[Objective] Combining the complex sentence structure features of ancient texts, a method with higher accuracy for identifying entity words in ancient texts was developed to further the development of digital humanities research. [Methods] Trigger words and relative words were used as key feature words to identify entity words, and a sentence pattern template was designed. Based on the characteristics of ancient texts, a Bert-BiLSTM-MHA-CRF model was constructed. The fusion of syntactic features and the Bert-BiLSTM-MHA-CRF model was used to achieve deep and fine-grained entity recognition of ancient texts. [Results] The F1 Score of this method is 0.88 on the conventional annotated test data set, 0.83 on the small sample annotated test data set, 0.79 (The Book of Songs), 0.81 (Master Lü's Spring and Autumn Annals) and 0.85 (Discourses of the States) on the transfer learning test data set. [Limitations] In the design of syntactic feature templates, only single ancient books are used as feature templates. Semantic information mining does not take into account the structural features of characters such as phonetic symbols and radicals in ancient texts. [Conclusions] In small sample annotation and transfer learning experiments, this method can also achieve accurate named entity recognition of ancient texts, providing high quality corpus data for digital humanities research. © 2024 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:136 / 148
页数:12
相关论文
共 35 条
[1]  
He Yanhua, Results and Reflection of Conservation of Ancient Book Collection in PKU Library in the Past Ten Years since the Implementation of China's Project of Conservation of Ancient Books, Journal of Academic Libraries, 36, 2, pp. 107-111, (2018)
[2]  
Ouyang Jian, Risks and Prevention Strategies of Data Copyright in Digital Humanities Application Service, Journal of Library Science in China, 49, 1, pp. 118-128, (2023)
[3]  
Shi Wei, Li Tianshuo, Wang Yu, Some Thoughts on the Transformation and Utilization of Ancient Book Resources in the New Era, Library and Information Service, 67, 11, pp. 71-76, (2023)
[4]  
Gao Dan, He Lin, Digital Humanities Researches from the Perspective of Data Intelligence Empowerment: Data, Technology and Applications, Library Tribune, 43, 9, pp. 107-119, (2023)
[5]  
Liu P, Guo Y M, Wang F L, Et al., Chinese Named Entity Recognition: The State of the Art, Neurocomputing, 473, pp. 37-53, (2022)
[6]  
Lafferty J D, McCallum A, Pereira F C N., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 18th International Conference on Machine Learning, pp. 282-289, (2001)
[7]  
Devlin J, Chang M W, Lee K, Et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186, (2019)
[8]  
Wang S X, Wang X, Wang S M, Et al., Bi-directional Long Short-Term Memory Method Based on Attention Mechanism and Rolling Update for Short-Term Load Forecasting, International Journal of Electrical Power & Energy Systems, 109, pp. 470-479, (2019)
[9]  
Yang X Z, Peng G J, Zhang D N, Et al., PowerDetector: Malicious PowerShell Script Family Classification Based on Multi-Modal Semantic Fusion and Deep Learning, China Communications, 20, 11, pp. 202-224, (2023)
[10]  
Bao Zhenshan, Song Bingyan, Zhang Wenbo, Et al., Named Entity Recognition in Traditional Chinese Medicine Books Combining Semi-supervised Learning and Rule-based Approach, Journal of Chinese Information Processing, 36, 6, pp. 90-100, (2022)