Challenges and Advances in Information Extraction from Scientific Literature: a Review

被引:27
作者
Hong, Zhi [1 ]
Ward, Logan [2 ]
Chard, Kyle [1 ,2 ]
Blaiszik, Ben [1 ,2 ]
Foster, Ian [1 ,2 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] Argonne Natl Lab, Lemont, IL USA
关键词
Information extraction; Text mining; Scientific data; PROPERTY DATA; RECOGNITION; GENERATION; RECAPTCHA; STANDARD; SYSTEM; WEB;
D O I
10.1007/s11837-021-04902-9
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Scientific articles have long been the primary means of disseminating scientific discoveries. Over the centuries, valuable data and potentially groundbreaking insights have been collected and buried deep in the mountain of publications. In materials engineering, such data are spread across technical handbooks specification sheets, journal articles, and laboratory notebooks in myriad formats. Extracting information from papers on a large scale has been a tedious and time-consuming job to which few researchers have wanted to devote their limited time and effort, yet is an activity that is essential for modern data-driven design practices. However, in recent years, significant progress has been made by the computer science community on techniques for automated information extraction from free text. Yet, transformative application of these techniques to scientific literature remains elusive-due not to a lack of interest or effort but to technical and logistical challenges. Using the challenges in the materials science literature as a driving motivation, we review the gaps between state-of-the-art information extraction methods and the practical application of such methods to scientific texts, and offer a comprehensive overview of work that can be undertaken to close these gaps.
引用
收藏
页码:3383 / 3400
页数:18
相关论文
共 136 条
  • [51] The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies
    Kirklin, Scott
    Saal, James E.
    Meredig, Bryce
    Thompson, Alex
    Doak, Jeff W.
    Aykol, Muratahan
    Ruehl, Stephan
    Wolverton, Chris
    [J]. NPJ COMPUTATIONAL MATERIALS, 2015, 1
  • [52] Text-mined dataset of inorganic materials synthesis recipes
    Kononova, Olga
    Huo, Haoyan
    He, Tanjin
    Rong, Ziqin
    Botari, Tiago
    Sun, Wenhao
    Tshitoyan, Vahe
    Ceder, Gerbrand
    [J]. SCIENTIFIC DATA, 2019, 6 (1)
  • [53] CHEMDNER: The drugs and chemical names extraction challenge
    Krallinger, Martin
    Leitner, Florian
    Rabal, Obdulia
    Vazquez, Miguel
    Oyarzabal, Julen
    Valencia, Alfonso
    [J]. JOURNAL OF CHEMINFORMATICS, 2015, 7
  • [54] Kruiper R., 2020, ARXIV PREPRINT ARXIV
  • [55] SCIENTIFIC LITERATURE Information overload How to manage the research-paper deluge? Blogs, colleagues and social media can all help
    Landhuis, Esther
    [J]. NATURE, 2016, 535 (7612) : 457 - 458
  • [56] A Survey on Deep Learning for Named Entity Recognition
    Li, Jing
    Sun, Aixin
    Han, Jianglei
    Li, Chenliang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (01) : 50 - 70
  • [57] Liu T., 2017, C EMP METH NAT LANG, P1790
  • [58] TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries
    Liu, Ying
    Bai, Kun
    Mitra, Prasenjit
    Giles, C. Lee
    [J]. PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, : 91 - 100
  • [59] Luan Y, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3219
  • [60] Annotating social determinants of health using active learning, and characterizing determinants using neural event extraction
    Lybarger, Kevin
    Ostendorf, Mari
    Yetisgen, Meliha
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2021, 113