BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing

被引:0
作者
Fries, Jason Alan [1 ]
Weber, Leon [2 ,3 ]
Seelam, Natasha [4 ]
Altay, Gabriel [5 ]
Datta, Debajyoti [6 ]
Su, Ruisi [7 ]
Garda, Samuele [2 ]
Kang, Sunny M. S. [8 ]
Biderman, Stella [9 ,10 ]
Samwald, Matthias [11 ]
Bach, Stephen H. [12 ]
Kusa, Wojciech [13 ]
Cahyawijaya, Samuel [14 ]
Barth, Fabio [2 ]
Ott, Simon [11 ]
Saenger, Mario [2 ]
Wang, Bo
Callahan, Alison [1 ]
Perinan, Daniel Leon
Gigant, Theo [7 ]
Haller, Patrick [2 ]
Chim, Jenny
Posada, Jose
Giorgi, John
Sivaraman, Karthik Rangasai
Pamies, Marc
Nezhurina, Marianna
Martin, Robert [2 ]
Freidank, Moritz
Dahlberg, Nathan [7 ]
Mishra, Shubhanshu
Bose, Shamik [7 ]
Broad, Nicholas
Labrak, Yanis
Deshmukh, Shlok S.
Kiblawi, Sid
Singh, Ayush [7 ]
Vu, Minh Chien
Neeraj, Trishala
Golde, Jonas [2 ]
del Moral, Albert Villanova
Beilharz, Benjamin
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Humboldt Univ, Berlin, Germany
[3] Max Delbruck Ctr Mol Med, Berlin, Germany
[4] Sherlock Biosci, Watertown, MA 02472 USA
[5] Tempus Labs Inc, Chicago, IL 60654 USA
[6] Univ Virginia, Charlottesville, VA 22903 USA
[7] BigScience, New York, NY USA
[8] Immuneering, New York, NY USA
[9] EleutherAI, Oinoi, Greece
[10] Booz Allen Hamilton, Mclean, VA USA
[11] Med Univ Vienna, Inst Artificial Intelligence, Vienna, Austria
[12] Brown Univ, Providence, RI 02912 USA
[13] TU Wien, Vienna, Austria
[14] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Training and evaluating language models increasingly requires the construction of meta-datasets - diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BIGBIO a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BIGBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BIGBIO is an ongoing community effort and is available at this URL.
引用
收藏
页数:15
相关论文
共 50 条
  • [11] A data-centric framework for debugging highly parallel applications
    Minh Ngoc Dinh
    Abramson, David
    Jin, Chao
    Gontarek, Andrew
    Moench, Bob
    DeRose, Luiz
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2015, 45 (04) : 501 - 526
  • [12] A framework for collecting provenance in data-centric scientific workflows
    Simmhan, Yogesh L.
    Plale, Beth
    Gannon, Dennis
    [J]. ICWS 2006: IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, PROCEEDINGS, 2006, : 427 - +
  • [13] Applying a Data-centric framework for Developing Model Transformations
    Camargo, Luiz Carlos
    Del Fabro, Marcos Didonet
    [J]. SAC '19: PROCEEDINGS OF THE 34TH ACM/SIGAPP SYMPOSIUM ON APPLIED COMPUTING, 2019, : 1570 - 1573
  • [14] A framework with data-centric accountability and auditability for cloud storage
    Jin, Hao
    Zhou, Ke
    Luo, Yan
    [J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (11) : 5903 - 5926
  • [15] Data-Centric Framework for Adaptive Smart City Honeynets
    Dowling, Seamus
    Schukat, Michael
    Melvin, Hugh
    [J]. 2017 SMART CITY SYMPOSIUM PRAGUE (SCSP), 2017,
  • [16] A framework with data-centric accountability and auditability for cloud storage
    Hao Jin
    Ke Zhou
    Yan Luo
    [J]. The Journal of Supercomputing, 2018, 74 : 5903 - 5926
  • [17] Biomedical Natural Language Processing
    Hamon, Thierry
    [J]. TRAITEMENT AUTOMATIQUE DES LANGUES, 2013, 54 (03): : 77 - 79
  • [18] Biomedical Natural Language Processing
    Kim, Jin-Dong
    [J]. COMPUTATIONAL LINGUISTICS, 2017, 43 (01) : 265 - 267
  • [19] Rhyme: A Data-Centric Expressive Query Language for Nested Data Structures
    Abeysinghe, Supun
    Rompf, Tiark
    [J]. PRACTICAL ASPECTS OF DECLARATIVE LANGUAGES, PADL 2024, 2023, 14512 : 64 - 81
  • [20] Is artificial data useful for biomedical Natural Language Processing algorithms?
    Wang, Zixu
    Ive, Julia
    Velupillai, Sumithra
    Specia, Lucia
    [J]. SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 240 - 249