BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing

被引:0
|
作者
Fries, Jason Alan [1 ]
Weber, Leon [2 ,3 ]
Seelam, Natasha [4 ]
Altay, Gabriel [5 ]
Datta, Debajyoti [6 ]
Su, Ruisi [7 ]
Garda, Samuele [2 ]
Kang, Sunny M. S. [8 ]
Biderman, Stella [9 ,10 ]
Samwald, Matthias [11 ]
Bach, Stephen H. [12 ]
Kusa, Wojciech [13 ]
Cahyawijaya, Samuel [14 ]
Barth, Fabio [2 ]
Ott, Simon [11 ]
Saenger, Mario [2 ]
Wang, Bo
Callahan, Alison [1 ]
Perinan, Daniel Leon
Gigant, Theo [7 ]
Haller, Patrick [2 ]
Chim, Jenny
Posada, Jose
Giorgi, John
Sivaraman, Karthik Rangasai
Pamies, Marc
Nezhurina, Marianna
Martin, Robert [2 ]
Freidank, Moritz
Dahlberg, Nathan [7 ]
Mishra, Shubhanshu
Bose, Shamik [7 ]
Broad, Nicholas
Labrak, Yanis
Deshmukh, Shlok S.
Kiblawi, Sid
Singh, Ayush [7 ]
Vu, Minh Chien
Neeraj, Trishala
Golde, Jonas [2 ]
del Moral, Albert Villanova
Beilharz, Benjamin
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Humboldt Univ, Berlin, Germany
[3] Max Delbruck Ctr Mol Med, Berlin, Germany
[4] Sherlock Biosci, Watertown, MA 02472 USA
[5] Tempus Labs Inc, Chicago, IL 60654 USA
[6] Univ Virginia, Charlottesville, VA 22903 USA
[7] BigScience, New York, NY USA
[8] Immuneering, New York, NY USA
[9] EleutherAI, Oinoi, Greece
[10] Booz Allen Hamilton, Mclean, VA USA
[11] Med Univ Vienna, Inst Artificial Intelligence, Vienna, Austria
[12] Brown Univ, Providence, RI 02912 USA
[13] TU Wien, Vienna, Austria
[14] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Training and evaluating language models increasingly requires the construction of meta-datasets - diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BIGBIO a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BIGBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BIGBIO is an ongoing community effort and is available at this URL.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Gaspar Data-Centric Framework
    Silva, Rui
    Sobral, J. L.
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 234 - 247
  • [2] A Framework for Verifying Data-Centric Protocols
    Deng, Yuxin
    Grumbach, Stephane
    Monin, Jean-Francois
    FORMAL TECHNIQUES FOR DISTRIBUTED SYSTEMS, 2011, 6722 : 106 - 120
  • [3] Data-Centric and Model-Centric Approaches for Biomedical Question Answering
    Yoon, Wonjin
    Yoo, Jaehyo
    Seo, Sumin
    Sung, Mujeen
    Jeong, Minbyul
    Kim, Gangwoo
    Kang, Jaewoo
    EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2022), 2022, 13390 : 204 - 216
  • [4] Natural Language Query Processing Framework for Biomedical Literature
    De Maio, Carmen
    Fenza, Giuseppe
    Loia, Vincenzo
    Parente, Mimmo
    PROCEEDINGS OF THE 2015 CONFERENCE OF THE INTERNATIONAL FUZZY SYSTEMS ASSOCIATION AND THE EUROPEAN SOCIETY FOR FUZZY LOGIC AND TECHNOLOGY, 2015, 89 : 1628 - 1635
  • [5] A data-centric distributed framework for engineering design
    Chen, B
    Liu, DJ
    Mahdavi, B
    Zhou, Q
    Bouhemhem, D
    Ndiaye, A
    Guibault, F
    Ozell, B
    Pelletier, D
    Trépanier, JY
    48TH ANNUAL CONFERENCE OF THE CANADIAN AERONAUTICS AND SPACE INSTITUTE, PROCEEDINGS: CANADIAN AERONAUTICS-STAYING COMPETITIVE IN GLOBAL MARKETS, 2001, : 89 - 96
  • [6] A Data-Centric Framework for Composable NLP Workflows
    Liu, Zhengzhong
    Ding, Guanxiong
    Bukkittu, Avinash
    Gupta, Mansi
    Gao, Pengzhi
    Ahmed, Atif
    Zhang, Shikun
    Gao, Xin
    Singhavi, Swapnil
    Li, Linwei
    Wei, Wei
    Hu, Zecong
    Shi, Haoran
    Liang, Xiaodan
    Mitamura, Teruko
    Xing, Eric P.
    Hu, Zhiting
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, 2020, : 197 - 204
  • [7] A Data-Centric Optimization Framework for Machine Learning
    Rausch, Oliver
    Ben-Nun, Tal
    Dryden, Nikoli
    Ivanov, Andrei
    Li, Shigang
    Hoefler, Torsten
    PROCEEDINGS OF THE 36TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2022, 2022,
  • [8] A data-centric distributed framework for MDO management
    Chen, B
    Liu, DJ
    Mahdavi, B
    Zhou, Q
    PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, 2001, : 279 - 284
  • [9] Semantic biomedical resource discovery: a Natural Language Processing framework
    Pepi Sfakianaki
    Lefteris Koumakis
    Stelios Sfakianakis
    Galatia Iatraki
    Giorgos Zacharioudakis
    Norbert Graf
    Kostas Marias
    Manolis Tsiknakis
    BMC Medical Informatics and Decision Making, 15
  • [10] Semantic biomedical resource discovery: a Natural Language Processing framework
    Sfakianaki, Pepi
    Koumakis, Lefteris
    Sfakianakis, Stelios
    Iatraki, Galatia
    Zacharioudakis, Giorgos
    Graf, Norbert
    Marias, Kostas
    Tsiknakis, Manolis
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2015, 15