Challenges and Advances in Information Extraction from Scientific Literature: a Review

被引:27
作者
Hong, Zhi [1 ]
Ward, Logan [2 ]
Chard, Kyle [1 ,2 ]
Blaiszik, Ben [1 ,2 ]
Foster, Ian [1 ,2 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] Argonne Natl Lab, Lemont, IL USA
关键词
Information extraction; Text mining; Scientific data; PROPERTY DATA; RECOGNITION; GENERATION; RECAPTCHA; STANDARD; SYSTEM; WEB;
D O I
10.1007/s11837-021-04902-9
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Scientific articles have long been the primary means of disseminating scientific discoveries. Over the centuries, valuable data and potentially groundbreaking insights have been collected and buried deep in the mountain of publications. In materials engineering, such data are spread across technical handbooks specification sheets, journal articles, and laboratory notebooks in myriad formats. Extracting information from papers on a large scale has been a tedious and time-consuming job to which few researchers have wanted to devote their limited time and effort, yet is an activity that is essential for modern data-driven design practices. However, in recent years, significant progress has been made by the computer science community on techniques for automated information extraction from free text. Yet, transformative application of these techniques to scientific literature remains elusive-due not to a lack of interest or effort but to technical and logistical challenges. Using the challenges in the materials science literature as a driving motivation, we review the gaps between state-of-the-art information extraction methods and the practical application of such methods to scientific texts, and offer a comprehensive overview of work that can be undertaken to close these gaps.
引用
收藏
页码:3383 / 3400
页数:18
相关论文
共 136 条
  • [1] Aliwy A.H., 2017, International Journal of Applied Engineering Research, V12, P4309
  • [2] OPTIMADE, an API for exchanging materials data
    Andersen, Casper W.
    Armiento, Rickard
    Blokhin, Evgeny
    Conduit, Gareth J.
    Dwaraknath, Shyam
    Evans, Matthew L.
    Fekete, Adam
    Gopakumar, Abhijith
    Grazulis, Saulius
    Merkys, Andrius
    Mohamed, Fawzi
    Oses, Corey
    Pizzi, Giovanni
    Rignanese, Gian-Marco
    Scheidgen, Markus
    Talirz, Leopold
    Toher, Cormac
    Winston, Donald
    Aversa, Rossella
    Choudhary, Kamal
    Colinet, Pauline
    Curtarolo, Stefano
    Di Stefano, Davide
    Draxl, Claudia
    Er, Suleyman
    Esters, Marco
    Fornari, Marco
    Giantomassi, Matteo
    Govoni, Marco
    Hautier, Geoffroy
    Hegde, Vinay
    Horton, Matthew K.
    Huck, Patrick
    Huhs, Georg
    Hummelshoj, Jens
    Kariryaa, Ankit
    Kozinsky, Boris
    Kumbhar, Snehal
    Liu, Mohan
    Marzari, Nicola
    Morris, Andrew J.
    Mostofi, Arash A.
    Persson, Kristin A.
    Petretto, Guido
    Purcell, Thomas
    Ricci, Francesco
    Rose, Frisco
    Scheffler, Matthias
    Speckhard, Daniel
    Uhrin, Martin
    [J]. SCIENTIFIC DATA, 2021, 8 (01)
  • [3] Angeli G, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P344
  • [4] [Anonymous], 2015, WORKSH 29 AAAI C ART
  • [5] [Anonymous], 2014, Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013, DOI DOI 10.1039/9781849733069
  • [6] Beltagy I., 2019, C EMP METH NAT LANG
  • [7] DBpedia - A crystallization point for the Web of Data
    Bizer, Christian
    Lehmann, Jens
    Kobilarov, Georgi
    Auer, Soeren
    Becker, Christian
    Cyganiak, Richard
    Hellmann, Sebastian
    [J]. JOURNAL OF WEB SEMANTICS, 2009, 7 (03): : 154 - 165
  • [8] Blaiszik B., 2016, J MATER
  • [9] A data ecosystem to support machine learning in materials science
    Blaiszik, Ben
    Ward, Logan
    Schwarting, Marcus
    Gaff, Jonathon
    Chard, Ryan
    Pike, Daniel
    Chard, Kyle
    Foster, Ian
    [J]. MRS COMMUNICATIONS, 2019, 9 (04) : 1125 - 1133
  • [10] Blaschke Christian, 2002, Brief Bioinform, V3, P154, DOI 10.1093/bib/3.2.154