Data-driven materials research enabled by natural language processing and information extraction

被引:183
作者
Olivetti, Elsa A. [1 ]
Cole, Jacqueline M. [2 ,3 ,4 ]
Kim, Edward [5 ]
Kononova, Olga [6 ,7 ]
Ceder, Gerbrand [6 ,7 ]
Han, Thomas Yong-Jin [8 ]
Hiszpanski, Anna M. [8 ]
机构
[1] MIT, Dept Mat Sci & Engn, Cambridge, MA 02139 USA
[2] Univ Cambridge, Dept Phys, Cavendish Lab, JJ Thomson Ave, Cambridge CB3 0HE, England
[3] Rutherford Appleton Lab, ISIS Neutron & Muon Source, Harwell Sci & Innovat Campus, Didcot OX11 0QX, Oxon, England
[4] Univ Cambridge, Dept Chem Engn & Biotechnol, West Cambridge Site,Philippa Fawcett Dr, Cambridge CB3 0AS, England
[5] Xero, Sci Evaluat & Measurement, Toronto, ON M5H 4G1, Canada
[6] Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA
[7] Lawrence Berkeley Natl Lab, Mat Sci Div, Berkeley, CA 94720 USA
[8] Lawrence Livermore Natl Lab, Div Mat Sci, Livermore, CA 94550 USA
基金
美国国家科学基金会; 英国科学技术设施理事会;
关键词
RECOGNITION; DESIGN; INFRASTRUCTURE; DISCOVERY; KNOWLEDGE; PLATFORM; SYSTEM; GENOME;
D O I
10.1063/5.0021106
中图分类号
O59 [应用物理学];
学科分类号
摘要
Given the emergence of data science and machine learning throughout all aspects of society, but particularly in the scientific domain, there is increased importance placed on obtaining data. Data in materials science are particularly heterogeneous, based on the significant range in materials classes that are explored and the variety of materials properties that are of interest. This leads to data that range many orders of magnitude, and these data may manifest as numerical text or image-based information, which requires quantitative interpretation. The ability to automatically consume and codify the scientific literature across domains-enabled by techniques adapted from the field of natural language processing-therefore has immense potential to unlock and generate the rich datasets necessary for data science and machine learning. This review focuses on the progress and practices of natural language processing and text mining of materials science literature and highlights opportunities for extracting additional information beyond text contained in figures and tables in articles. We discuss and provide examples for several reasons for the pursuit of natural language processing for materials, including data compilation, hypothesis development, and understanding the trends within and across fields. Current and emerging natural language processing methods along with their applications to materials science are detailed. We, then, discuss natural language processing and data challenges within the materials science domain where future directions may prove valuable.
引用
收藏
页数:19
相关论文
共 138 条
[21]   Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references [J].
Bornmann, Lutz ;
Mutz, Ruediger .
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (11) :2215-2222
[22]   Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data [J].
Clark, Alex M. ;
Williams, Antony J. ;
Ekins, Sean .
JOURNAL OF CHEMINFORMATICS, 2015, 7
[23]   A survey of current work in biomedical text mining [J].
Cohen, AM ;
Hersh, WR .
BRIEFINGS IN BIOINFORMATICS, 2005, 6 (01) :57-71
[24]   A Design-to-Device Pipeline for Data-Driven Materials Discovery [J].
Cole, Jacqueline M. .
ACCOUNTS OF CHEMICAL RESEARCH, 2020, 53 (03) :599-610
[25]   Data mining with molecular design rules identifies new class of dyes for dye-sensitised solar cells [J].
Cole, Jacqueline M. ;
Low, Kian Sing ;
Ozoe, Hiroaki ;
Stathi, Panagiota ;
Kitamura, Chitoshi ;
Kurata, Hiroyuki ;
Rudolf, Petra ;
Kawase, Takeshi .
PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2014, 16 (48) :26684-26690
[26]  
Conneau Alexis, 2017, ARXIV171004087
[27]   Design-to-Device Approach Affords Panchromatic Co-Sensitized Solar Cells [J].
Cooper, Christopher B. ;
Beard, Edward J. ;
Vazquez-Mayagoitia, Alvaro ;
Stan, Liliana ;
Stenning, Gavin B. G. ;
Nye, Daniel W. ;
Vigil, Julian A. ;
Tomar, Tina ;
Jia, Jingwen ;
Bodedla, Govardhana B. ;
Chen, Song ;
Gallego, Lucia ;
Franco, Santiago ;
Carella, Antonio ;
Thomas, K. R. Justin ;
Xue, Song ;
Zhu, Xunjin ;
Cole, Jacqueline M. .
ADVANCED ENERGY MATERIALS, 2019, 9 (05)
[28]   Data Descriptor: Auto-generated materials database of Curie and Neel temperatures via semi-supervised relationship extraction [J].
Court, Callum J. ;
Cole, Jacqueline M. .
SCIENTIFIC DATA, 2018, 5
[29]   AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations [J].
Curtarolo, Stefano ;
Setyawan, Wahyu ;
Wang, Shidong ;
Xue, Junkai ;
Yang, Kesong ;
Taylor, Richard H. ;
Nelson, Lance J. ;
Hart, Gus L. W. ;
Sanvito, Stefano ;
Buongiorno-Nardelli, Marco ;
Mingo, Natalio ;
Levy, Ohad .
COMPUTATIONAL MATERIALS SCIENCE, 2012, 58 :227-235
[30]  
Dai X, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1460