Data quantity governance for machine learning in materials science

被引:88
作者
Liu, Yue [1 ,4 ]
Yang, Zhengwei [1 ]
Zou, Xinxin [1 ]
Ma, Shuchang [1 ]
Liu, Dahui [1 ]
Avdeev, Maxim [5 ,6 ]
Shi, Siqi [2 ,3 ]
机构
[1] Shanghai Univ, Sch Comp Engn & Sci, Shanghai 200444, Peoples R China
[2] Shanghai Univ, Sch Mat Sci & Engn, State Key Lab Adv Special Steel, Shanghai 200444, Peoples R China
[3] Shanghai Univ, Mat Genome Inst, Shanghai 200444, Peoples R China
[4] Shanghai Engn Res Ctr Intelligent Comp Syst, Sch Chem, Shanghai 200444, Peoples R China
[5] Australian Nucl Sci & Technol Org, Sydney 2232, Australia
[6] Univ Sydney, Sch Chem, Sydney 2006, Australia
基金
中国国家自然科学基金;
关键词
machine learning; data governance; data quantity; materials science; DENSITY-FUNCTIONAL THEORY; FEATURE-SELECTION; PREDICTION; CLASSIFICATION; EXTRACTION; DISCOVERY; CARBON; CREEP; FCC;
D O I
10.1093/nsr/nwad125
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This paper proposed a synergistic materials governance flow with incorporation of domain knowledge to provide a high-quality data foundation for accelerating materials design and discovery. Data-driven machine learning (ML) is widely employed in the analysis of materials structure-activity relationships, performance optimization and materials design due to its superior ability to reveal latent data patterns and make accurate prediction. However, because of the laborious process of materials data acquisition, ML models encounter the issue of the mismatch between a high dimension of feature space and a small sample size (for traditional ML models) or the mismatch between model parameters and sample size (for deep-learning models), usually resulting in terrible performance. Here, we review the efforts for tackling this issue via feature reduction, sample augmentation and specific ML approaches, and show that the balance between the number of samples and features or model parameters should attract great attention during data quantity governance. Following this, we propose a synergistic data quantity governance flow with the incorporation of materials domain knowledge. After summarizing the approaches to incorporating materials domain knowledge into the process of ML, we provide examples of incorporating domain knowledge into governance schemes to demonstrate the advantages of the approach and applications. The work paves the way for obtaining the required high-quality data to accelerate materials design and discovery based on ML.
引用
收藏
页数:17
相关论文
共 100 条
[1]   Perspective: Materials informatics and big data: Realization of the "fourth paradigm" of science in materials science [J].
Agrawal, Ankit ;
Choudhary, Alok .
APL MATERIALS, 2016, 4 (05)
[2]   Exploration of data science techniques to predict fatigue strength of steel from composition and processing parameters [J].
Agrawal A. ;
Deshpande P.D. ;
Cecen A. ;
Basavarsu G.P. ;
Choudhary A.N. ;
Kalidindi S.R. .
Integrating Materials and Manufacturing Innovation, 2014, 3 (01) :90-108
[3]   Named Entity Extraction for Knowledge Graphs: A Literature Overview [J].
Al-Moslmi, Tareq ;
Ocana, Marc Gallofre ;
Opdahl, Andreas L. ;
Veres, Csaba .
IEEE ACCESS, 2020, 8 :32862-32881
[4]   Beyond Scaling Relations for the Description of Catalytic Materials [J].
Andersen, Mie ;
Levchenko, Sergey V. ;
Scheffler, Matthias ;
Reuter, Karsten .
ACS CATALYSIS, 2019, 9 (04) :2752-2759
[5]  
Aziz R, 2017, AIMS Bioengineering, V4, P179, DOI [10.3934/bioeng.2017.2.179, 10.3934/bioeng.2017.1.179, 10.3934/bioeng.2017.2.179, DOI 10.3934/BIOENG.2017.2.179]
[6]  
Bäuml B, 2019, IEEE INT CONF ROBOT, P4262, DOI [10.1109/icra.2019.8794021, 10.1109/ICRA.2019.8794021]
[7]   New tolerance factor to predict the stability of perovskite oxides and halides [J].
Bartel, Christopher J. ;
Sutton, Christopher ;
Goldsmith, Bryan R. ;
Ouyang, Runhai ;
Musgrave, Charles B. ;
Ghiringhelli, Luca M. ;
Scheffler, Matthias .
SCIENCE ADVANCES, 2019, 5 (02)
[8]   Physical descriptor for the Gibbs energy of inorganic crystalline solids and temperature-dependent materials chemistry [J].
Bartel, Christopher J. ;
Millican, Samantha L. ;
Deml, Ann M. ;
Rumptz, John R. ;
Tumas, William ;
Weimer, Alan W. ;
Lany, Stephan ;
Stevanovic, Vladan ;
Musgrave, Charles B. ;
Holder, Aaron M. .
NATURE COMMUNICATIONS, 2018, 9
[9]   Active learning for accelerated design of layered materials [J].
Bassman, Lindsay ;
Rajak, Pankaj ;
Kalia, Rajiv K. ;
Nakano, Aiichiro ;
Sha, Fei ;
Sun, Jifeng ;
Singh, David J. ;
Aykol, Muratahan ;
Huck, Patrick ;
Persson, Kristin ;
Vashishta, Priya .
NPJ COMPUTATIONAL MATERIALS, 2018, 4
[10]  
Ben Veyseh AP, 2022, Arxiv, DOI arXiv:2209.04951