A general framework to govern machine learning oriented materials data quality

被引:1
作者
Liu, Yue [1 ,2 ]
Yang, Zhengwei [1 ,2 ]
Zou, Xinxin [1 ,2 ]
Lin, Yuxiao [6 ]
Ma, Shuchang [1 ,2 ]
Zuo, Wei [1 ,2 ]
Zou, Zheyi [5 ]
Wang, Hong [7 ,8 ]
Avdeev, Maxim [9 ,10 ]
Shi, Siqi [1 ,3 ,4 ]
机构
[1] Shanghai Univ, State Key Lab Mat Adv Nucl Energy, Shanghai 200444, Peoples R China
[2] Shanghai Univ, Sch Comp Engn & Sci, Shanghai 200444, Peoples R China
[3] Shanghai Univ, Sch Mat Sci & Engn, Shanghai 200444, Peoples R China
[4] Shanghai Univ, Mat Genome Inst, Shanghai 200444, Peoples R China
[5] Xiangtan Univ, Sch Mat Sci & Engn, Xiangtan 411105, Peoples R China
[6] Jiangsu Normal Univ, Sch Phys & Elect Engn, Xuzhou 221116, Peoples R China
[7] Shanghai Jiao Tong Univ, Mat Genome Initiat Ctr, Shanghai 200240, Peoples R China
[8] Shanghai Jiao Tong Univ, Sch Mat Sci & Engn, Shanghai 200240, Peoples R China
[9] Australian Nucl Sci & Technol Org, Sydney, NSW 2232, Australia
[10] Univ Sydney, Sch Chem, Sydney, NSW 2006, Australia
关键词
Machine learning; Materials science; Data quality; Domain knowledge; IONIC-CONDUCTIVITY; CRYSTAL-STRUCTURE; OPTIMIZATION; KNOWLEDGE; DESIGN;
D O I
10.1016/j.mser.2025.101050
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Machine learning (ML) is increasingly applied in materials discovery and property prediction, mainly due to its advantage of low-cost and efficient data analysis process. The materials data quality can heavily influence the performance of ML models. However, most current data quality improvement approaches are purely data-driven, neglecting materials domain knowledge and data quality issues latent in the entire process of ML modelling. Here, we address the definition of high-quality data and propose a general framework for ML-oriented MATerials Data Quality Governance incorporating domain knowledge (MAT-DQG), involving nine dimensions defining WHAT materials data quality should be evaluated, lifecycle models guiding WHEN to execute data governance activities in the entire process of ML modelling, and processing models guiding HOW to detect and address issues related to materials data quality. 60 datasets from materials ML studies are assembled to demonstrate potential utility and applications of MAT-DQG, including mining complicated structure-activity relationships in metals, inorganic non-metals, polymers, and composite materials. MAT-DQG identifies and resolves issues in 17 datasets and as a result prediction accuracy improvements of up to 49 % are achieved. Our work lays a foundation for governing ML-oriented materials data and ensuring its reusability and reliability, which advances the frontiers of materials discovery and design.
引用
收藏
页数:21
相关论文
共 118 条
[1]   Spearman's correlation coefficient in statistical analysis [J].
Abd Al-Hameeda, Khawla Ali .
INTERNATIONAL JOURNAL OF NONLINEAR ANALYSIS AND APPLICATIONS, 2022, 13 (01) :3249-3255
[2]   A Modeling and execution environment for distributed scientific workflows [J].
Altintas, I ;
Bhagwanani, S ;
Buttler, D ;
Chandra, S ;
Cheng, ZG ;
Coleman, MA ;
Critchlow, T ;
Gupta, A ;
Han, W ;
Liu, L ;
Ludäscher, B ;
Pu, C ;
Moore, R ;
Shoshani, A ;
Vouk, M .
SSDBM 2002: 15TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2003, :247-250
[3]   Managing the Computational Chemistry Big Data Problem: The ioChem-BD Platform [J].
Alvarez-Moreno, M. ;
de Graaf, C. ;
Lopez, N. ;
Maseras, F. ;
Poblet, J. M. ;
Bo, C. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (01) :95-103
[4]  
Amaro R., 2024, need Implement FAIR Princ. Biomol. Simul
[5]   Best practices in machine learning for chemistry comment [J].
Artrith, Nongnuch ;
Butler, Keith T. ;
Coudert, Francois-Xavier ;
Han, Seungwu ;
Isayev, Olexandr ;
Jain, Anubhav ;
Walsh, Aron .
NATURE CHEMISTRY, 2021, 13 (06) :505-508
[6]   Geometric deep learning on molecular representations [J].
Atz, Kenneth ;
Grisoni, Francesca ;
Schneider, Gisbert .
NATURE MACHINE INTELLIGENCE, 2021, 3 (12) :1023-1032
[7]   Permutation entropy: A natural complexity measure for time series [J].
Bandt, C ;
Pompe, B .
PHYSICAL REVIEW LETTERS, 2002, 88 (17) :4
[8]   A critical examination of compound stability predictions from machine-learned formation energies [J].
Bartel, Christopher J. ;
Trewartha, Amalie ;
Wang, Qi ;
Dunn, Alexander ;
Jain, Anubhav ;
Ceder, Gerbrand .
NPJ COMPUTATIONAL MATERIALS, 2020, 6 (01)
[9]  
Batatia I, 2022, Arxiv, DOI [arXiv:2206.07697, 10.48550/arXiv.2206.07697]
[10]   Methodologies for Data Quality Assessment and Improvement [J].
Batini, Carlo ;
Cappiello, Cinzia ;
Francalanci, Chiara ;
Maurino, Andrea .
ACM COMPUTING SURVEYS, 2009, 41 (03)