Named Entity Recognition in Government Audit Texts Based on ChineseBERT and Character-Word Fusion

被引:2
作者
Huang, Baohua [1 ]
Lin, Yunjie [1 ]
Pang, Si [1 ]
Fu, Long [1 ]
机构
[1] Guangxi Univ, Sch Comp Elect & Informat, Nanning 530004, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 04期
基金
中国国家自然科学基金;
关键词
smart audit; named entity recognition; character-word fusion; GHM loss function; ChineseBERT;
D O I
10.3390/app14041425
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Named entity recognition of government audit text is a key task of intelligent auditing. Aiming at the problems of scarcity of corpus in the field of governmental auditing, insufficient utilization of traditional character vector word-level information features, and insufficient capturing of auditing entity features, this study builds its own dataset in the field of auditing and proposes the model CW-CBGC for recognizing named entities in governmental auditing text based on ChineseBERT and character-word fusion. First, the ChineseBERT pre-training model is used to extract the character vector that integrates the features of glyph and pinyin, combining with word vectors dynamically constructed by the BERT pre-training model; then, the sequences of character-word fusion vectors are input into the bi-directional gated recurrent neural network (BiGRU) to learn the textual features. Finally, the global optimal sequence label is generated by Conditional Random Field (CRF), and the GHM classification loss function is used in the model training to solve the problem of error evaluation under the conditions of noisy entities and unbalanced number of entities. The F1 value of this study's model on the audit dataset is 97.23%, which is 3.64% higher than the baseline model's F1 value; the F1 value of the model on the public dataset Resume is 96.26%, which is 0.73-2.78% higher than the mainstream model. The experimental results show that the model proposed in this paper can effectively recognize the entities in government audit texts and has certain generalization ability.
引用
收藏
页数:15
相关论文
共 50 条
[21]   Named Entity Recognition of Zhuang Language Based on the Feature of Initial Letter in Word [J].
Zhang, Weiquan ;
Tang, Suqin ;
He, Danni ;
Li, Tinghui ;
Pan, Changchun .
6TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE, ICIAI2022, 2022, :44-49
[22]   Named-entity recognition from Greek and English texts [J].
Karkaletsis, V ;
Paliouras, G ;
Petasis, G ;
Manousopoulou, N ;
Spyropoulos, CD .
JOURNAL OF INTELLIGENT & ROBOTIC SYSTEMS, 1999, 26 (02) :123-135
[23]   Named-Entity Recognition from Greek and English Texts [J].
Vangelis Karkaletsis ;
Georgios Paliouras ;
Georgios Petasis ;
Natasa Manousopoulou ;
Constantine D. Spyropoulos .
Journal of Intelligent and Robotic Systems, 1999, 26 :123-135
[24]   Character Gazetteer for Named Entity Recognition with Linear Matching Complexity [J].
Dlugolinsky, Stefan ;
Nguyen, Giang ;
Laclavik, Michal ;
Seleng, Martin .
2013 THIRD WORLD CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGIES (WICT), 2013, :361-365
[25]   LDA in Character-LSTM-CRF Named Entity Recognition [J].
Konopik, Miloslav ;
Prazak, Ondrej .
TEXT, SPEECH, AND DIALOGUE (TSD 2018), 2018, 11107 :58-66
[26]   Cimind: A phonetic-based tool for multilingual named entity recognition in biomedical texts [J].
Cabot, Chloe ;
Darmoni, Stefan ;
Soualmia, Lina F. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 94
[27]   Shahmukhi named entity recognition by using contextualized word embeddings [J].
Tehseen, Amina ;
Ehsan, Toqeer ;
Bin Liaqat, Hannan ;
Kong, Xiangjie ;
Ali, Amjad ;
Al-Fuqaha, Ala .
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 229
[28]   Fusion of multiple features for Chinese Named Entity Recognition based on CRF model [J].
Zhang, Yuejie ;
Xu, Zhiting ;
Zhang, Tao .
INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 :95-+
[29]   GoalBERT: A Lightweight Named-Entity Recognition Model Based on Multiple Fusion [J].
Xu, Yingjie ;
Tan, Xiaobo ;
Wang, Mengxuan ;
Zhang, Wenbo .
APPLIED SCIENCES-BASEL, 2024, 14 (23)
[30]   Government Domain Named Entity Recognition for South African Languages [J].
Eiselen, Roald .
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, :3344-3348