Unsupervised word embeddings capture latent knowledge from materials science literature

被引:689
作者
Tshitoyan, Vahe [1 ,3 ]
Dagdelen, John [1 ,2 ]
Weston, Leigh [1 ]
Dunn, Alexander [1 ,2 ]
Rong, Ziqin [1 ]
Kononova, Olga [2 ]
Persson, Kristin A. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA
[3] Google LLC, Mountain View, CA 94043 USA
关键词
TOTAL-ENERGY CALCULATIONS; THERMAL-CONDUCTIVITY; EFFICIENCY;
D O I
10.1038/s41586-019-1335-8
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases(1,2), which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing(3-10), which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings(11-13) (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
引用
收藏
页码:95 / +
页数:12
相关论文
共 40 条
  • [1] Machine learning for molecular and materials science
    Butler, Keith T.
    Davies, Daniel W.
    Cartwright, Hugh
    Isayev, Olexandr
    Walsh, Aron
    [J]. NATURE, 2018, 559 (7715) : 547 - 555
  • [2] Bringing Transparency Design into Practice
    Eiband, Malin
    Schneider, Hanna
    Bilandzic, Mark
    Fazekas-Con, Julian
    Haug, Mareike
    Hussmann, Heinrich
    [J]. IUI 2018: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2018, : 211 - 223
  • [3] Chemical named entities recognition: a review on approaches and applications
    Eltyeb, Safaa
    Salim, Naomie
    [J]. JOURNAL OF CHEMINFORMATICS, 2014, 6
  • [4] Machine Learning Energies of 2 Million Elpasolite (ABC2D6) Crystals
    Faber, Felix A.
    Lindmaa, Alexander
    von Lilienfeld, O. Anatole
    Armiento, Rickard
    [J]. PHYSICAL REVIEW LETTERS, 2016, 117 (13)
  • [5] Friedman C, 2001, Bioinformatics, V17 Suppl 1, pS74
  • [6] Data-Driven Review of Thermoelectric Materials: Performance and Resource Considerations
    Gaultois, Michael W.
    Sparks, Taylor D.
    Borg, Christopher K. H.
    Seshadri, Ram
    Bonificio, William D.
    Clarke, David R.
    [J]. CHEMISTRY OF MATERIALS, 2013, 25 (15) : 2911 - 2920
  • [7] Advances in thermoelectric materials research: Looking back and moving forward
    He, Jian
    Tritt, Terry M.
    [J]. SCIENCE, 2017, 357 (6358)
  • [8] Materials science with large-scale data and informatics: Unlocking new opportunities
    Hill, Joanne
    Mulholland, Gregory
    Persson, Kristin
    Seshadri, Ram
    Wolverton, Chris
    Meredig, Bryce
    [J]. MRS BULLETIN, 2016, 41 (05) : 399 - 409
  • [9] INHOMOGENEOUS ELECTRON-GAS
    RAJAGOPAL, AK
    CALLAWAY, J
    [J]. PHYSICAL REVIEW B, 1973, 7 (05) : 1912 - 1919
  • [10] Jain A., 2013, CONCURR COMPUT, V27, P5037