Recognition of the agricultural named entities with multi-feature fusion based on BERT

被引:0
|
作者
Zhao P. [1 ]
Zhao C. [1 ,2 ]
Wu H. [2 ,3 ,4 ]
Wang W. [2 ,3 ]
机构
[1] School of Engineering, Shanxi Agricultural University, Taigu
[2] National Engineering Research Center for Information Technology in Agriculture, Beijing
[3] Beijing Research Center for Information Technology in Agriculture, Beijing
[4] Beijing Research Center of Intelligent Equipment for Agriculture, Beijing
关键词
Agriculture; BERT; BiLSTM; Dictionary feature; Named entity recognition; Text;
D O I
10.11975/j.issn.1002-6819.2022.03.013
中图分类号
学科分类号
摘要
Agricultural named entity recognition is a fundamental task for information extraction in the agricultural domain. Aiming at the problems of local context features, unable to solve the polysemy of the word, low recognition rate of rare entities in the process of entity recognition, the model combined with character level features and dictionary feature was proposed to automatically identify entities in the text, the character level features were obtained from the BERT(Bidirectional Encoder Representations from Transformers)model. Firstly, the BERT pre-trained language model was used to integrate the left and right contextual information to obtain the character level features, enhance the semantic representation of words, in order to alleviate the problem of polysemy; Secondly, we built an agricultural dictionary and introduced external dictionary information through the feature extraction strategy to improve the recognition accuracy of the model for rare or unknown entities. Among them, two feature extraction strategies were designed to capture the dictionary features, included N-gram feature template algorithm and bi-direction maximum matching algorithm. Then, the character level features and dictionary features were fused as the input of the next neural network layer. Finally, the fused feature information were encoded by the BiLSTM (Bi-directional Long-short Term Memory) neural network layer, obtained the sequence feature matrix, and the optimal text label sequence was obtained by CRF (Conditional Random Field). Based on the knowledge of domain experts, a labeling strategy of named entities in the agricultural field was proposed to solve the problem of fuzzy boundaries of agricultural named entities, in order to ensure the integrity of the entities. The experiments were carried out on the corpus of agricultural, which contained 5 295 labeled corpora and 5 categories of agricultural entities. The results showed that better overall performance was achieved in the corpus, where the recognition precision, recall, and F1-score were 94.84%, 95.23%, and 95.03%, respectively. In terms of specific categories, due to the obvious boundary characteristics of crop diseases and pesticide, the model achieved higher recognition precision than the remaining three entities of agricultural, such as machinery, pests, and crop variety. Experimental comparison showed that for the effectiveness of the dictionary feature extraction strategy, the performance of the model based on the bi-direction maximum matching algorithm was better than the N-gram feature template algorithm. When the number of templates was 10, the performance of the model based on N-gram feature template was the best with the recognition precision of93.95%and F1-score of 94.03%. The bi-directional maximum matching algorithm using feature embedding can obtain more potential information, which was better than one-hot encoding. The precision and F1-score of the model were improved by 0.49 and 0.91 percentage points, respectively. Compared with the models based on BiLSTM-CRF, BERT-BiLSTM-CRF, the precision of the BERT-Dic-BiLSTM-CRF model proposed in this paper had obvious performance advantages with the highest recognition precision of 94.84%. Compared with the BERT-BiLSTM-CRF model, for the recognition performance of rare or unknown entities, the recognition precision of the BERT-Dic-BiLSTM-CRF model was improved by 5.93 and 6.44 percentage points, respectively. Further verifying that the integration of dictionary features into the model can improve the recognition accuracy of the model for such entities. © 2022, Editorial Department of the Transactions of the Chinese Society of Agricultural Engineering. All right reserved.
引用
收藏
页码:112 / 118
页数:6
相关论文
共 30 条
  • [1] Zhang Shanwen, Wang Zhen, Wang Zuliang, Prediction of wheat stripe rust disease by combining knowledge graph and bidirectional long short term memory network, Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 36, 12, pp. 172-178, (2020)
  • [2] Zhang J, Shen D, Zhou G D, Et al., Enhancing HMM-based biomedical named entity recognition by studying special phenomena, Journal of Biomedical Informatics, 37, 6, pp. 411-422, (2004)
  • [3] Saha S K, Sarkar S, Mitra P., Feature selection techniques for maximum entropy based biomedical named entity recognition, Journal of Biomedical Informatics, 42, 5, pp. 905-911, (2009)
  • [4] Sun C J, Guan Y, Wang X L, Et al., Rich features based conditional random fields for biological named entities recognition, Computers in Biology and Medicine, 37, 9, pp. 1327-1333, (2007)
  • [5] Li Xiang, Wei Xiaohong, Jia Lu, Et al., Recognition of crops, diseases and pesticides named entities in Chinese based on conditional random fields, Transactions of the Chinese Society for Agricultural Machinery, 48, pp. 178-185, (2017)
  • [6] Huang Nian'e, Huang He, Wang Rujing, Agriculture-related product name extraction and category labeling based on ontology and conditional random field, Journal of Computer Applications, 37, 1, pp. 233-238, (2017)
  • [7] Wang Chunyu, Wang Fang, Study on recognition of chinese agricultural named entity with conditional random fields, Journal of Agricultural University of Hebei, 37, 1, pp. 132-135, (2014)
  • [8] Xu K, Yang Z G, Kang P P, Et al., Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition, Computers in Biology and Medicine, 108, 22, pp. 122-132, (2019)
  • [9] Maryam H, Leon W, Mariana N, Et al., Deep learning with word embeddings improves biomedical named entity recognition, Bioinformatics, 33, 14, pp. 37-48, (2017)
  • [10] Wang Q, Zhou Y M, Ruan T, Et al., Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition, Journal of Biomedical Informatics, 92, (2019)