BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing

被引：0

作者：

Fries, Jason Alan ^{[1
]}

Weber, Leon ^{[2
,3
]}

Seelam, Natasha ^{[4
]}

Altay, Gabriel ^{[5
]}

Datta, Debajyoti ^{[6
]}

Su, Ruisi ^{[7
]}

Garda, Samuele ^{[2
]}

Kang, Sunny M. S. ^{[8
]}

Biderman, Stella ^{[9
,10
]}

Samwald, Matthias ^{[11
]}

Bach, Stephen H. ^{[12
]}

Kusa, Wojciech ^{[13
]}

Cahyawijaya, Samuel ^{[14
]}

Barth, Fabio ^{[2
]}

Ott, Simon ^{[11
]}

Saenger, Mario ^{[2
]}

Wang, Bo

Callahan, Alison ^{[1
]}

Perinan, Daniel Leon

Gigant, Theo ^{[7
]}

Haller, Patrick ^{[2
]}

Chim, Jenny

Posada, Jose

Giorgi, John

Sivaraman, Karthik Rangasai

Pamies, Marc

Nezhurina, Marianna

Martin, Robert ^{[2
]}

Freidank, Moritz

Dahlberg, Nathan ^{[7
]}

Mishra, Shubhanshu

Bose, Shamik ^{[7
]}

Broad, Nicholas

Labrak, Yanis

Deshmukh, Shlok S.

Kiblawi, Sid

Singh, Ayush ^{[7
]}

Vu, Minh Chien

Neeraj, Trishala

Golde, Jonas ^{[2
]}

del Moral, Albert Villanova

Beilharz, Benjamin

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Humboldt Univ, Berlin, Germany

[3] Max Delbruck Ctr Mol Med, Berlin, Germany

[4] Sherlock Biosci, Watertown, MA 02472 USA

[5] Tempus Labs Inc, Chicago, IL 60654 USA

[6] Univ Virginia, Charlottesville, VA 22903 USA

[7] BigScience, New York, NY USA

[8] Immuneering, New York, NY USA

[9] EleutherAI, Oinoi, Greece

[10] Booz Allen Hamilton, Mclean, VA USA

[11] Med Univ Vienna, Inst Artificial Intelligence, Vienna, Austria

[12] Brown Univ, Providence, RI 02912 USA

[13] TU Wien, Vienna, Austria

[14] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

欧盟地平线“2020”;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Training and evaluating language models increasingly requires the construction of meta-datasets - diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BIGBIO a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BIGBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BIGBIO is an ongoing community effort and is available at this URL.

引用

页数：15

共 50 条

[11] A data-centric framework for debugging highly parallel applications
Minh Ngoc Dinh
Abramson, David
Jin, Chao
Gontarek, Andrew
Moench, Bob
DeRose, Luiz
[J]. SOFTWARE-PRACTICE & EXPERIENCE, 2015, 45 (04) : 501 - 526
[12] A framework for collecting provenance in data-centric scientific workflows
Simmhan, Yogesh L.
Plale, Beth
Gannon, Dennis
[J]. ICWS 2006: IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, PROCEEDINGS, 2006, : 427 - +
[13] Applying a Data-centric framework for Developing Model Transformations
Camargo, Luiz Carlos
Del Fabro, Marcos Didonet
[J]. SAC '19: PROCEEDINGS OF THE 34TH ACM/SIGAPP SYMPOSIUM ON APPLIED COMPUTING, 2019, : 1570 - 1573
[14] A framework with data-centric accountability and auditability for cloud storage
Jin, Hao
Zhou, Ke
Luo, Yan
[J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (11) : 5903 - 5926
[15] Data-Centric Framework for Adaptive Smart City Honeynets
Dowling, Seamus
Schukat, Michael
Melvin, Hugh
[J]. 2017 SMART CITY SYMPOSIUM PRAGUE (SCSP), 2017,
[16] A framework with data-centric accountability and auditability for cloud storage
Hao Jin
Ke Zhou
Yan Luo
[J]. The Journal of Supercomputing, 2018, 74 : 5903 - 5926
[17] Biomedical Natural Language Processing
Hamon, Thierry
[J]. TRAITEMENT AUTOMATIQUE DES LANGUES, 2013, 54 (03): : 77 - 79
[18] Biomedical Natural Language Processing
Kim, Jin-Dong
[J]. COMPUTATIONAL LINGUISTICS, 2017, 43 (01) : 265 - 267
[19] Rhyme: A Data-Centric Expressive Query Language for Nested Data Structures
Abeysinghe, Supun
Rompf, Tiark
[J]. PRACTICAL ASPECTS OF DECLARATIVE LANGUAGES, PADL 2024, 2023, 14512 : 64 - 81
[20] Is artificial data useful for biomedical Natural Language Processing algorithms?
Wang, Zixu
Ive, Julia
Velupillai, Sumithra
Specia, Lucia
[J]. SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 240 - 249

← 1 2 3 4 5 →