Practical feature filter strategy to machine learning for small datasets in chemistry

被引:0
|
作者
Hu, Yang [1 ,2 ]
Sandt, Roland [1 ,2 ]
Spatschek, Robert [1 ,2 ,3 ]
机构
[1] Forschungszentrum Julich GmbH, Inst Energy Mat Devices IMD 1, D-52428 Julich, Germany
[2] Rhein Westfal TH Aachen, Georesources & Mat Engn, D-52062 Aachen, Germany
[3] Jara Energy, D-52428 Julich, Germany
来源
SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期
关键词
ENERGIES;
D O I
10.1038/s41598-024-71342-1
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Many potential use cases for machine learning in chemistry and materials science suffer from small dataset sizes, which demands special care for the model design in order to deliver reliable predictions. Hence, feature selection as the key determinant for dataset design is essential here. We propose a practical and efficient feature filter strategy to determine the best input feature candidates. We illustrate this strategy for the prediction of adsorption energies based on a public dataset and sublimation enthalpies using an in-house training dataset. The input of adsorption energies reduces the feature space from 12 dimensions to two and still delivers accurate results. For the sublimation enthalpies, three input configurations are filtered from 14 possible configurations with different dimensions for further productive predictions as being most relevant by using our feature filter strategy. The best extreme gradient boosting regression model possesses a good performance and is evaluated from statistical and theoretical perspectives, reaching a level of accuracy comparable to density functional theory computations and allowing for physical interpretations of the predictions. Overall, the results indicate that the feature filter strategy can help interdisciplinary scientists without rich professional AI knowledge and limited computational resources to establish a reliable small training dataset first, which may make the final machine learning model training easier and more accurate, avoiding time-consuming hyperparameter explorations and improper feature selection.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] A strategy to apply machine learning to small datasets in materials science
    Ying Zhang
    Chen Ling
    npj Computational Materials, 4
  • [2] A strategy to apply machine learning to small datasets in materials science
    Zhang, Ying
    Ling, Chen
    NPJ COMPUTATIONAL MATERIALS, 2018, 4
  • [3] A machine learning approach for corrosion small datasets
    Totok Sutojo
    Supriadi Rustad
    Muhamad Akrom
    Abdul Syukur
    Guruh Fajar Shidik
    Hermawan Kresno Dipojono
    npj Materials Degradation, 7
  • [4] A machine learning approach for corrosion small datasets
    Sutojo, Totok
    Rustad, Supriadi
    Akrom, Muhamad
    Syukur, Abdul
    Shidik, Guruh Fajar
    Dipojono, Hermawan Kresno
    NPJ MATERIALS DEGRADATION, 2023, 7 (01)
  • [5] Practical feature subset selection for machine learning
    Hall, MA
    Smith, LA
    PROCEEDINGS OF THE 21ST AUSTRALASIAN COMPUTER SCIENCE CONFERENCE, ACSC'98, 1998, 20 (01): : 181 - 191
  • [6] Machine Learning Methods with Noisy, Incomplete or Small Datasets
    Caiafa, Cesar F.
    Sun, Zhe
    Tanaka, Toshihisa
    Marti-Puig, Pere
    Sole-Casals, Jordi
    APPLIED SCIENCES-BASEL, 2021, 11 (09):
  • [7] Averaging Strategy for Interpretable Machine Learning on Small Datasets to Understand Element Uptake after Seed Nanotreatment
    Yu, Hengjie
    Tang, Shiyu
    Li, Sam Fong Yau
    Cheng, Fang
    ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2023, 57 (34) : 12760 - 12770
  • [8] Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning
    Champa, Arifa I.
    Rabbi, Md Fazle
    Zibran, Minhaz F.
    2024 IEEE 3RD INTERNATIONAL CONFERENCE ON COMPUTING AND MACHINE INTELLIGENCE, ICMI 2024, 2024,
  • [9] Decomposition Methods for Machine Learning with Small, Incomplete or Noisy Datasets
    Caiafa, Cesar Federico
    Sole-Casals, Jordi
    Marti-Puig, Pere
    Zhe, Sun
    Tanaka, Toshihisa
    APPLIED SCIENCES-BASEL, 2020, 10 (23): : 1 - 20
  • [10] Simple Baseline Machine Learning Text Classifiers for Small Datasets
    Riekert M.
    Riekert M.
    Klein A.
    SN Computer Science, 2021, 2 (3)