Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation

被引:7
作者
Wang, Weijie [1 ]
Li, Xiaoying [1 ]
Ren, Huiling [1 ]
Gao, Dongping [1 ]
Fang, An [1 ]
机构
[1] Chinese Acad Med Sci & Peking Union Med Coll, Inst Med Informat & Lib, 69 Dongdan N St, Beijing 100005, Peoples R China
关键词
Chinese clinical named entity recognition; multisemantic features; image feature; Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking; RoBERTa-wwm; convolutional neural network; CNN; CLASSIFICATION; DICTIONARY; TEXT;
D O I
10.2196/44597
中图分类号
R-058 [];
学科分类号
摘要
Background: Clinical electronic medical records (EMRs) contain important information on patients' anatomy, symptoms, examinations, diagnoses, and medications. Large-scale mining of rich medical information from EMRs will provide notable reference value for medical research. With the complexity of Chinese grammar and blurred boundaries of Chinese words, Chinese clinical named entity recognition (CNER) remains a notable challenge. Follow-up tasks such as medical entity structuring, medical entity standardization, medical entity relationship extraction, and medical knowledge graph construction largely depend on medical named entity recognition effects. A promising CNER result would provide reliable support for building domain knowledge graphs, knowledge bases, and knowledge retrieval systems. Furthermore, it would provide research ideas for scientists and medical decision-making references for doctors and even guide patients on disease and health management. Therefore, obtaining excellent CNER results is essential. Objective: We aimed to propose a Chinese CNER method to learn semantics-enriched representations for comprehensively enhancing machines to understand deep semantic information of EMRs by using multisemantic features, which makes medical information more readable and understandable. Methods: First, we used Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking (RoBERTa-wwm) with dynamic fusion and Chinese character features, including 5-stroke code, Zheng code, phonological code, and stroke code, extracted by 1-dimensional convolutional neural networks (CNNs) to obtain fine-grained semantic features of Chinese characters. Subsequently, we converted Chinese characters into square images to obtain Chinese character image features from another modality by using a 2-dimensional CNN. Finally, we input multisemantic features into Bidirectional Long Short-Term Memory with Conditional Random Fields to achieve Chinese CNER. The effectiveness of our model was compared with that of the baseline and existing research models, and the features involved in the model were ablated and analyzed to verify the model's effectiveness. Results: We collected 1379 Yidu-S4K EMRs containing 23,655 entities in 6 categories and 2007 self-annotated EMRs containing 118,643 entities in 7 categories. The experiments showed that our model outperformed the comparison experiments, with F1-scores of 89.28% and 84.61% on the Yidu-S4K and self-annotated data sets, respectively. The results of the ablation analysis demonstrated that each feature and method we used could improve the entity recognition ability. Conclusions: Our proposed CNER method would mine the richer deep semantic information in EMRs by multisemantic embedding using RoBERTa-wwm and CNNs, enhancing the semantic recognition of characters at different granularity levels and improving the generalization capability of the method by achieving information complementarity among different semantic features, thus making the machine semantically understand EMRs and improving the CNER task accuracy.
引用
收藏
页数:21
相关论文
共 72 条
[1]  
[Anonymous], YID S4K YID STRUCT 4
[2]  
[Anonymous], 2017, P 8 INT JOINT C NAT
[3]  
[Anonymous], 2012, P 2012 WORKSHOP BIOM
[4]   Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts [J].
Cocos, Anne ;
Fiks, Alexander G. ;
Masino, Aaron J. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (04) :813-821
[5]   The Heat Method for Distance Computation [J].
Crane, Keenan ;
Weischedel, Clarisse ;
Wardetzky, Max .
COMMUNICATIONS OF THE ACM, 2017, 60 (11) :90-99
[6]  
[崔少国 Cui Shaoguo], 2022, [电子科技大学学报, Journal of University of Electronic Science and Technology of China], V51, P565
[7]   Pre-Training With Whole Word Masking for Chinese BERT [J].
Cui, Yiming ;
Che, Wanxiang ;
Liu, Ting ;
Qin, Bing ;
Yang, Ziqing .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3504-3514
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Dong C, 2016, P 5 CCF C NAT LANG P, DOI [10.1007/978-3-319, DOI 10.1007/978-3-319]
[10]  
Electronic medical records Center, IIYI