BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER

被引:2
|
作者
Ghosh, Sreyan [1 ]
Tyagi, Utkarsh [1 ]
Kumar, Sonal [1 ]
Manocha, Dinesh [1 ]
机构
[1] Univ Maryland, College Pk, MD 20742 USA
来源
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023 | 2023年
关键词
Named Entity Recognition; Information Extraction; Biomedical; NAMED ENTITY RECOGNITION;
D O I
10.1145/3539618.3591957
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Biomedical Named Entity Recognition (BioNER) is the fundamental task of identifying named entities from biomedical text. However, BioNER suffers from severe data scarcity and lacks high-quality labeled data due to the highly specialized and expert knowledge required for annotation. Though data augmentation has shown to be highly effective for low-resource NER in general, existing data augmentation techniques fail to produce factual and diverse augmentations for BioNER. In this paper, we present BioAug, a novel data augmentation framework for low-resource BioNER. BioAug, built on BART, is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation. Post training, we perform conditional generation and generate diverse augmentations conditioning BioAug on selectively corrupted text similar to the training stage. We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets and show that BioAug outperforms all our baselines by a significant margin (1.5%-21.5% absolute improvement) and is able to generate augmentations that are both more factual and diverse. Code: https://github.com/Sreyan88/BioAug.
引用
收藏
页码:1853 / 1858
页数:6
相关论文
共 50 条
  • [1] Data Augmentation for Low-Resource Keyphrase Generation
    Garg, Krishna
    Chowdhury, Jishnu Ray
    Caragea, Cornelia
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8442 - 8455
  • [2] MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER
    Zhou, Ran
    Li, Xin
    He, Ruidan
    Bing, Lidong
    Cambria, Erik
    Si, Luo
    Miao, Chunyan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2251 - 2262
  • [3] ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER
    Ghosh, Sreyan
    Tyagi, Utkarsh
    Suri, Manan
    Kumar, Sonal
    Ramaneswaran, S.
    Manocha, Dinesh
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 104 - 125
  • [4] CASSI: Contextual and Semantic Structure-based Interpolation Augmentation for Low-Resource NER
    Surana, Tanmay
    Nga, Ho Thi
    Tun, Kyaw Zin
    Chng, Eng Siong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 9729 - 9742
  • [5] DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks
    Ding, Bosheng
    Liu, Linlin
    Bing, Lidong
    Kruengkrai, Canasai
    Nguyen, Thien Hai
    Joty, Shafiq
    Si, Luo
    Miao, Chunyan
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6045 - 6057
  • [6] Generalized Data Augmentation for Low-Resource Translation
    Xia, Mengzhou
    Kong, Xiang
    Anastasopoulos, Antonios
    Neubig, Graham
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5786 - 5796
  • [7] Data Augmentation for Low-Resource Quechua ASR Improvement
    Zevallos, Rodolfo
    Bel, Nuria
    Cambara, Guillermo
    Farrus, Mireia
    Luque, Jordi
    INTERSPEECH 2022, 2022, : 3518 - 3522
  • [8] SYNTHETIC DATA AUGMENTATION FOR IMPROVING LOW-RESOURCE ASR
    Thai, Bao
    Jimerson, Robert
    Arcoraci, Dominic
    Prud'hommeaux, Emily
    Ptucha, Raymond
    2019 IEEE WESTERN NEW YORK IMAGE AND SIGNAL PROCESSING WORKSHOP (WNYISPW), 2019,
  • [9] Data Augmentation for Low-Resource Neural Machine Translation
    Fadaee, Marzieh
    Bisazza, Arianna
    Monz, Christof
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 567 - 573
  • [10] Data Augmentation Methods for Low-Resource Orthographic Syllabification
    Suyanto, Suyanto
    Lhaksmana, Kemas M.
    Bijaksana, Moch Arif
    Kurniawan, Adriana
    IEEE ACCESS, 2020, 8 : 147399 - 147406