Structured information extraction from scientific text with large language models

被引:96
作者
Dagdelen, John [1 ,2 ]
Dunn, Alexander [1 ,2 ]
Lee, Sanghoon [1 ,2 ]
Walker, Nicholas [1 ]
Rosen, Andrew S. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Persson, Kristin A. [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Mat Sci & Engn Dept, Berkeley, CA USA
关键词
CANCER RESISTANCE; CELLULAR SENESCENCE; PHYLOGENETIC ANALYSIS; PREMATURE SENESCENCE; MOLE-RAT; MECHANISMS; TRANSCRIPTION; DISCOVERY; ALIGNMENT; PROVIDES;
D O I
10.1038/s41467-024-45563-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
引用
收藏
页数:14
相关论文
共 76 条
  • [1] Potential Mechanisms for Cancer Resistance in Elephants and Comparative Cellular Response to DNA Damage in Humans
    Abegglen, Lisa M.
    Caulin, Aleah F.
    Chan, Ashley
    Lee, Kristy
    Robinson, Rosann
    Campbell, Michael S.
    Kiso, Wendy K.
    Schmitt, Dennis L.
    Waddell, Peter J.
    Bhaskara, Srividya
    Jensen, Shane T.
    Maley, Carlo C.
    Schiffman, Joshua D.
    [J]. JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2015, 314 (17): : 1850 - 1860
  • [2] A pro longevity role for cellular senescence
    Attaallah, Amany
    Lenzi, Monia
    Marchionni, Silvia
    Bincoletto, Giacomo
    Cocchi, Veronica
    Croco, Eleonora
    Hrelia, Patrizia
    Hrelia, Silvana
    Sell, Christian
    Lorenzini, Antonello
    [J]. GEROSCIENCE, 2020, 42 (03) : 867 - 879
  • [3] Jab1 interacts directly with HIF-1α and regulates its stability
    Bae, MK
    Ahn, MY
    Jeong, JW
    Bae, MH
    Lee, YM
    Bae, SK
    Park, JW
    Kim, KR
    Kim, KW
    [J]. JOURNAL OF BIOLOGICAL CHEMISTRY, 2002, 277 (01) : 9 - 12
  • [4] Distinct aerobic and hypoxic mechanisms of HIF-α regulation by CSN5
    Bemis, L
    Chan, DA
    Finkielstein, CV
    Qi, L
    Sutphin, PD
    Chen, XJ
    Stenmark, K
    Giaccia, AJ
    Zundel, W
    [J]. GENES & DEVELOPMENT, 2004, 18 (07) : 739 - 744
  • [5] Blanco Enrique, 2007, Curr Protoc Bioinformatics, VChapter 4, DOI 10.1002/0471250953.bi0403s18
  • [6] CELLULAR SENESCENCE: AGING, CANCER, AND INJURY
    Calcinotto, Arianna
    Kohli, Jaskaren
    Zagato, Elena
    Pellegrini, Laura
    Demaria, Marco
    Alimonti, Andrea
    [J]. PHYSIOLOGICAL REVIEWS, 2019, 99 (02) : 1047 - 1078
  • [7] Campisi J, 2001, TRENDS CELL BIOL, V11, pS27, DOI 10.1016/S0962-8924(01)82148-6
  • [8] Aging, Cellular Senescence, and Cancer
    Campisi, Judith
    [J]. ANNUAL REVIEW OF PHYSIOLOGY, VOL 75, 2013, 75 : 685 - 705
  • [9] Time course regulatory analysis based on paired expression and chromatin accessibility data
    Duren, Zhana
    Chen, Xi
    Xin, Jingxue
    Wang, Yong
    Wong, Wing Hung
    [J]. GENOME RESEARCH, 2020, 30 (04) : 622 - 634
  • [10] RepeatModeler2 for automated genomic discovery of transposable element families
    Flynn, Jullien M.
    Hubley, Robert
    Goubert, Clement
    Rosen, Jeb
    Clark, Andrew G.
    Feschotte, Cedric
    Smit, Arian F.
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2020, 117 (17) : 9451 - 9457