Named Entity Recognition of Traditional Chinese Medicine Classics Based on SiKuBERT and Multivariate Data Embedding

被引:0
作者
Zhang, Wendong [1 ]
Wu, Ziwei [1 ]
Song, Guochang [1 ]
Huo, Qingao [1 ]
Wang, Bo [1 ]
机构
[1] College of Software, Xinjiang University, Xinjiang, Urumqi
来源
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science) | 2024年 / 52卷 / 06期
关键词
Compendium of Materia Medica; multivariate data embedding; named entity recognition; SiKuBERT; traditional Chinese medicine classics;
D O I
10.12141/j.issn.1000-565X.230143
中图分类号
学科分类号
摘要
The named entity recognition of traditional Chinese medicine (TCM) classics is the basis for constructing TCM knowledge graph, and is of great significance for the extraction and intelligent presentation of TCM knowledge. However, the knowledge system of TCM has a huge structure, and the publicly available corpus is scarce and semantically complex. Most of the current researches focus on the expression of character vectors, and do not fully consider the rich semantic features in the structural characteristics of special Chinese characters. Moreover, due to the rich semantic meaning of Chinese characters, there are still problems of insufficient expression of the potential features and polysemy of one word. In this paper, a named entity recognition method based on SiKuBERT and multivariate data embedding is proposed by combining the corpus features of ancient Chinese medicine books with the structural information of ancient Chinese characters. In this method, the word feature information is created by SiKuBERT, and on this basis, word features and radical features are embedded to capture the semantic information of Chinese characters, so that characters with similar radical sequences can be close to each other in the spatial vector. Then, the method is used to identify the names of people, herbal medicines, diseases, pathologies, and meridians in the Materia Medica dataset. The experimental results show that the proposed method is able to effectively extract five types of entities in the text, with an F1 score of 86. 66%, a precision rate of 86. 95%, and a recall rate of 86. 37%. As compared with the SiKuBERT-CRF model based on word features, the proposed method integrates the word information with the structural information of traditional Chinese characters, which enhances the entity recognition effect, and the overall F1 score is improved by 2. 83 percentage points. Moreover, the proposed method is most effective in the recognition of Chinese herbal medicine names and disease names with significant radicals, with the corresponding F1 scores respectively being improved by 3. 48 and 0. 97 percentage points, as compared with the SiKuBERT-CRF model based on word features. In general, the performance index of the proposed method is higher than other mainstream deep learning models and possesses good generalization ability. © 2024 South China University of Technology. All rights reserved.
引用
收藏
页码:128 / 137
页数:9
相关论文
共 34 条
  • [1] BAO Zhenshan, SONG Bingyan, ZHANG Wenbo, Named entity recognition in traditional Chinese medi⁃ cine books combining semi-supervised learning and rule-based approach, J. Journal of Chinese Information Processing, 36, 6, pp. 90-100, (2022)
  • [2] GAO Su, TAO Hu, JIANG Yanzhao, Sentencelevel joint event extraction of traditional Chinese medical literature, J. Technology Intelligence Engineering, 7, 5, pp. 15-29, (2021)
  • [3] LI Qianqian, FU Xing, YANG Feng, Construc⁃ tion and application of Treatise on Cold Pathogenic Diseases knowledge graph based on the diagnosis-treatment thinking of “Treatment Based on Disease and Pulse and Syndrome Together, J. Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology, 24, 9, pp. 3613-3621, (2022)
  • [4] MA Y, LIU Y, ZHANG D, A multigranularity text driven named entity recognition CGAN model for tra⁃ ditional Chinese medicine literatures, J. Computa⁃ tional Intelligence and Neuroscience, 2022, 1, (2022)
  • [5] YI Junhui, ZHA Qinglin, Survey of TCM symptom in⁃ formation extraction, J. Computer Engineering and Applications, 59, 17, pp. 35-47, (2023)
  • [6] FUKUDA K,, TSUNODA T,, TAMURA A, To⁃ ward information extraction: identifying protein names from biological papers, J. Pacific Symposium on Bio⁃ computing Pacific Symposium on Biocomputing, 98, pp. 707-718, (1997)
  • [7] BIKEL D M,, MILLER S,, SCHWARTZ R, Nymble:a high-performance learning name-finder C∥ Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194-201, (1997)
  • [8] JAYNESE T., Information theory and statistical mechanics, J.Physical Review, 106, 4, pp. 620-630, (1957)
  • [9] MCCALLUM A, LI W., Early results for named entity recognition with conditional random fields, feature in⁃ duction and web-enhanced lexicons C∥Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 188-191, (2003)
  • [10] ASAHARA M, MATSUMOTO Y., Japanese named en⁃ tity extraction with redundant morphological analysis C∥Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 8-15, (2003)