BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing

被引：0

作者：

Fries, Jason Alan ^{[1
]}

Weber, Leon ^{[2
,3
]}

Seelam, Natasha ^{[4
]}

Altay, Gabriel ^{[5
]}

Datta, Debajyoti ^{[6
]}

Su, Ruisi ^{[7
]}

Garda, Samuele ^{[2
]}

Kang, Sunny M. S. ^{[8
]}

Biderman, Stella ^{[9
,10
]}

Samwald, Matthias ^{[11
]}

Bach, Stephen H. ^{[12
]}

Kusa, Wojciech ^{[13
]}

Cahyawijaya, Samuel ^{[14
]}

Barth, Fabio ^{[2
]}

Ott, Simon ^{[11
]}

Saenger, Mario ^{[2
]}

Wang, Bo

Callahan, Alison ^{[1
]}

Perinan, Daniel Leon

Gigant, Theo ^{[7
]}

Haller, Patrick ^{[2
]}

Chim, Jenny

Posada, Jose

Giorgi, John

Sivaraman, Karthik Rangasai

Pamies, Marc

Nezhurina, Marianna

Martin, Robert ^{[2
]}

Freidank, Moritz

Dahlberg, Nathan ^{[7
]}

Mishra, Shubhanshu

Bose, Shamik ^{[7
]}

Broad, Nicholas

Labrak, Yanis

Deshmukh, Shlok S.

Kiblawi, Sid

Singh, Ayush ^{[7
]}

Vu, Minh Chien

Neeraj, Trishala

Golde, Jonas ^{[2
]}

del Moral, Albert Villanova

Beilharz, Benjamin

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Humboldt Univ, Berlin, Germany

[3] Max Delbruck Ctr Mol Med, Berlin, Germany

[4] Sherlock Biosci, Watertown, MA 02472 USA

[5] Tempus Labs Inc, Chicago, IL 60654 USA

[6] Univ Virginia, Charlottesville, VA 22903 USA

[7] BigScience, New York, NY USA

[8] Immuneering, New York, NY USA

[9] EleutherAI, Oinoi, Greece

[10] Booz Allen Hamilton, Mclean, VA USA

[11] Med Univ Vienna, Inst Artificial Intelligence, Vienna, Austria

[12] Brown Univ, Providence, RI 02912 USA

[13] TU Wien, Vienna, Austria

[14] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

欧盟地平线“2020”;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Training and evaluating language models increasingly requires the construction of meta-datasets - diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BIGBIO a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BIGBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BIGBIO is an ongoing community effort and is available at this URL.

引用

页数：15

共 50 条

[1] Gaspar Data-Centric Framework
Silva, Rui
Sobral, J. L.
HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 234 - 247
[2] A Framework for Verifying Data-Centric Protocols
Deng, Yuxin
Grumbach, Stephane
Monin, Jean-Francois
FORMAL TECHNIQUES FOR DISTRIBUTED SYSTEMS, 2011, 6722 : 106 - 120
[3] Data-Centric and Model-Centric Approaches for Biomedical Question Answering
Yoon, Wonjin
Yoo, Jaehyo
Seo, Sumin
Sung, Mujeen
Jeong, Minbyul
Kim, Gangwoo
Kang, Jaewoo
EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2022), 2022, 13390 : 204 - 216
[4] Natural Language Query Processing Framework for Biomedical Literature
De Maio, Carmen
Fenza, Giuseppe
Loia, Vincenzo
Parente, Mimmo
PROCEEDINGS OF THE 2015 CONFERENCE OF THE INTERNATIONAL FUZZY SYSTEMS ASSOCIATION AND THE EUROPEAN SOCIETY FOR FUZZY LOGIC AND TECHNOLOGY, 2015, 89 : 1628 - 1635
[5] A data-centric distributed framework for engineering design
Chen, B
Liu, DJ
Mahdavi, B
Zhou, Q
Bouhemhem, D
Ndiaye, A
Guibault, F
Ozell, B
Pelletier, D
Trépanier, JY
48TH ANNUAL CONFERENCE OF THE CANADIAN AERONAUTICS AND SPACE INSTITUTE, PROCEEDINGS: CANADIAN AERONAUTICS-STAYING COMPETITIVE IN GLOBAL MARKETS, 2001, : 89 - 96
[6] A Data-Centric Framework for Composable NLP Workflows
Liu, Zhengzhong
Ding, Guanxiong
Bukkittu, Avinash
Gupta, Mansi
Gao, Pengzhi
Ahmed, Atif
Zhang, Shikun
Gao, Xin
Singhavi, Swapnil
Li, Linwei
Wei, Wei
Hu, Zecong
Shi, Haoran
Liang, Xiaodan
Mitamura, Teruko
Xing, Eric P.
Hu, Zhiting
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, 2020, : 197 - 204
[7] A Data-Centric Optimization Framework for Machine Learning
Rausch, Oliver
Ben-Nun, Tal
Dryden, Nikoli
Ivanov, Andrei
Li, Shigang
Hoefler, Torsten
PROCEEDINGS OF THE 36TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2022, 2022,
[8] A data-centric distributed framework for MDO management
Chen, B
Liu, DJ
Mahdavi, B
Zhou, Q
PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, 2001, : 279 - 284
[9] Semantic biomedical resource discovery: a Natural Language Processing framework
Pepi Sfakianaki
Lefteris Koumakis
Stelios Sfakianakis
Galatia Iatraki
Giorgos Zacharioudakis
Norbert Graf
Kostas Marias
Manolis Tsiknakis
BMC Medical Informatics and Decision Making, 15
[10] Semantic biomedical resource discovery: a Natural Language Processing framework
Sfakianaki, Pepi
Koumakis, Lefteris
Sfakianakis, Stelios
Iatraki, Galatia
Zacharioudakis, Giorgos
Graf, Norbert
Marias, Kostas
Tsiknakis, Manolis
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2015, 15

← 1 2 3 4 5 →