Structured information extraction from scientific text with large language models

被引:164
作者
Dagdelen, John [1 ,2 ]
Dunn, Alexander [1 ,2 ]
Lee, Sanghoon [1 ,2 ]
Walker, Nicholas [1 ]
Rosen, Andrew S. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Persson, Kristin A. [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Mat Sci & Engn Dept, Berkeley, CA USA
关键词
CANCER RESISTANCE; CELLULAR SENESCENCE; PHYLOGENETIC ANALYSIS; PREMATURE SENESCENCE; MOLE-RAT; MECHANISMS; TRANSCRIPTION; DISCOVERY; ALIGNMENT; PROVIDES;
D O I
10.1038/s41467-024-45563-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
引用
收藏
页数:14
相关论文
共 76 条
[1]   Potential Mechanisms for Cancer Resistance in Elephants and Comparative Cellular Response to DNA Damage in Humans [J].
Abegglen, Lisa M. ;
Caulin, Aleah F. ;
Chan, Ashley ;
Lee, Kristy ;
Robinson, Rosann ;
Campbell, Michael S. ;
Kiso, Wendy K. ;
Schmitt, Dennis L. ;
Waddell, Peter J. ;
Bhaskara, Srividya ;
Jensen, Shane T. ;
Maley, Carlo C. ;
Schiffman, Joshua D. .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2015, 314 (17) :1850-1860
[2]   A pro longevity role for cellular senescence [J].
Attaallah, Amany ;
Lenzi, Monia ;
Marchionni, Silvia ;
Bincoletto, Giacomo ;
Cocchi, Veronica ;
Croco, Eleonora ;
Hrelia, Patrizia ;
Hrelia, Silvana ;
Sell, Christian ;
Lorenzini, Antonello .
GEROSCIENCE, 2020, 42 (03) :867-879
[3]   Jab1 interacts directly with HIF-1α and regulates its stability [J].
Bae, MK ;
Ahn, MY ;
Jeong, JW ;
Bae, MH ;
Lee, YM ;
Bae, SK ;
Park, JW ;
Kim, KR ;
Kim, KW .
JOURNAL OF BIOLOGICAL CHEMISTRY, 2002, 277 (01) :9-12
[4]   Distinct aerobic and hypoxic mechanisms of HIF-α regulation by CSN5 [J].
Bemis, L ;
Chan, DA ;
Finkielstein, CV ;
Qi, L ;
Sutphin, PD ;
Chen, XJ ;
Stenmark, K ;
Giaccia, AJ ;
Zundel, W .
GENES & DEVELOPMENT, 2004, 18 (07) :739-744
[5]  
Blanco Enrique, 2007, Curr Protoc Bioinformatics, VChapter 4, DOI 10.1002/0471250953.bi0403s18
[6]   CELLULAR SENESCENCE: AGING, CANCER, AND INJURY [J].
Calcinotto, Arianna ;
Kohli, Jaskaren ;
Zagato, Elena ;
Pellegrini, Laura ;
Demaria, Marco ;
Alimonti, Andrea .
PHYSIOLOGICAL REVIEWS, 2019, 99 (02) :1047-1078
[7]  
Campisi J, 2001, TRENDS CELL BIOL, V11, pS27, DOI 10.1016/S0962-8924(01)82148-6
[8]   Aging, Cellular Senescence, and Cancer [J].
Campisi, Judith .
ANNUAL REVIEW OF PHYSIOLOGY, VOL 75, 2013, 75 :685-705
[9]   Time course regulatory analysis based on paired expression and chromatin accessibility data [J].
Duren, Zhana ;
Chen, Xi ;
Xin, Jingxue ;
Wang, Yong ;
Wong, Wing Hung .
GENOME RESEARCH, 2020, 30 (04) :622-634
[10]   RepeatModeler2 for automated genomic discovery of transposable element families [J].
Flynn, Jullien M. ;
Hubley, Robert ;
Goubert, Clement ;
Rosen, Jeb ;
Clark, Andrew G. ;
Feschotte, Cedric ;
Smit, Arian F. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2020, 117 (17) :9451-9457