Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning

被引:15
作者
Chen, Peng [1 ]
Wang, Jian [1 ,3 ]
Lin, Hongfei [1 ]
Zhao, Di [2 ]
Yang, Zhihao [1 ]
Wren, Jonathan [1 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci, Dalian 116024, Peoples R China
[2] Dalian Minzu Univ, Sch Comp Sci & Engn, Dalian 116600, Peoples R China
[3] Dalian Univ Technol, Sch Comp Sci & Technol, 2 Linggong Rd, Dalian 116024, Peoples R China
关键词
UMLS;
D O I
10.1093/bioinformatics/btad496
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Few-shot learning that can effectively perform named entity recognition in low-resource scenarios has raised growing attention, but it has not been widely studied yet in the biomedical field. In contrast to high-resource domains, biomedical named entity recognition (BioNER) often encounters limited human-labeled data in real-world scenarios, leading to poor generalization performance when training only a few labeled instances. Recent approaches either leverage cross-domain high-resource data or fine-tune the pre-trained masked language model using limited labeled samples to generate new synthetic data, which is easily stuck in domain shift problems or yields low-quality synthetic data. Therefore, in this article, we study a more realistic scenario, i.e. few-shot learning for BioNER. Results: Leveraging the domain knowledge graph, we propose knowledge-guided instance generation for few-shot BioNER, which generates diverse and novel entities based on similar semantic relations of neighbor nodes. In addition, by introducing question prompt, we cast BioNER as question-answering task and propose prompt contrastive learning to improve the robustness of the model by measuring the mutual information between query-answer pairs. Extensive experiments conducted on various few-shot settings show that the proposed framework achieves superior performance. Particularly, in a low-resource scenario with only 20 samples, our approach substantially outperforms recent state-of-the-art models on four benchmark datasets, achieving an average improvement of up to 7.1% F1. Availability and implementation: Our source code and data are available at https://github.com/cpmss521/KGPC.
引用
收藏
页数:10
相关论文
共 36 条
[1]  
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[2]  
Athiwaratkun B, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P375
[3]   The Unified Medical Language System (UMLS): integrating biomedical terminology [J].
Bodenreider, O .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270
[4]  
Chen Ting, 2019, ICML
[5]  
Chen X., 2022, P 29 INT C COMP LING, P2374, DOI DOI 10.48550/ARXIV.2109.00720
[6]  
Conneau Alexis., 2020, P 58 ANN M ASS COMP, P8440, DOI [DOI 10.18653/V1/2020.ACL-MAIN.747, 10.18653/v1/2020.acl-main.747]
[7]  
Dai Xiang., 2020, Proceedings of the 28th International Conference on Computational Linguistics, P3861
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Devon Hjelm R., 2019, 7 INT C LEARNING REP
[10]  
Ding B, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6045