Improving sample and feature selection with principal covariates regression

被引:25
作者
Cersonsky, Rose K. [1 ]
Helfrecht, Benjamin A. [1 ]
Engel, Edgar A. [2 ]
Kliavinek, Sergei [1 ]
Ceriotti, Michele [1 ]
机构
[1] Ecole Polytech Fed Lausanne, Lab Computat Sci & Modeling, IMX, CH-1015 Lausanne, Switzerland
[2] Univ Cambridge, Cavendish Lab, TCM Grp, JJ Thomson Ave, Cambridge CB3 0HE, England
来源
MACHINE LEARNING-SCIENCE AND TECHNOLOGY | 2021年 / 2卷 / 03期
关键词
machine learning; feature selection; sample selection; farthest point sampling; materials science; physical chemistry; semi-supervised learning; SINGULAR VALUE DECOMPOSITION; RANK-ONE MODIFICATION; NETWORKS; INPUT;
D O I
10.1088/2632-2153/abfe7c
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it improves the computational performance and often the transferability of a model. Here we focus on two popular subselection schemes applied to this end: CUR decomposition, derived from a low-rank approximation of the feature matrix, and farthest point sampling (FPS), which relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the principal covariates (PCov) regression method. We show how this results in selections that perform better in supervised tasks, demonstrating with models of increasing complexity, from ridge regression to kernel ridge regression and finally feed-forward neural networks. We also present adjustments to minimise the impact of any subselection when performing unsupervised tasks. We demonstrate the significant improvements associated with PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples required to achieve a given level of regression accuracy.
引用
收藏
页数:16
相关论文
共 65 条
  • [51] Sensitivity and dimensionality of atomic environment representations used for machine learning interatomic potentials
    Onat, Berk
    Ortner, Christoph
    Kermode, James R.
    [J]. JOURNAL OF CHEMICAL PHYSICS, 2020, 153 (14)
  • [52] Chemical shifts in molecular solids by machine learning
    Paruzzo, Federico M.
    Hofstetter, Albert
    Musil, Felix
    De, Sandip
    Ceriotti, Michele
    Emsley, Lyndon
    [J]. NATURE COMMUNICATIONS, 2018, 9
  • [53] Extreme Learning Machine-Based Classification of ADHD Using Brain Structural MRI Data
    Peng, Xiaolong
    Lin, Pan
    Zhang, Tongsheng
    Wang, Jue
    [J]. PLOS ONE, 2013, 8 (11):
  • [54] Scalable and accurate deep learning with electronic health records
    Rajkomar, Alvin
    Oren, Eyal
    Chen, Kai
    Dai, Andrew M.
    Hajaj, Nissan
    Hardt, Michaela
    Liu, Peter J.
    Liu, Xiaobing
    Marcus, Jake
    Sun, Mimi
    Sundberg, Patrik
    Yee, Hector
    Zhang, Kun
    Zhang, Yi
    Flores, Gerardo
    Duggan, Gavin E.
    Irvine, Jamie
    Quoc Le
    Litsch, Kurt
    Mossin, Alexander
    Tansuwan, Justin
    Wang, De
    Wexler, James
    Wilson, Jimbo
    Ludwig, Dana
    Volchenboum, Samuel L.
    Chou, Katherine
    Pearson, Michael
    Madabushi, Srinivasan
    Shah, Nigam H.
    Butte, Atul J.
    Howell, Michael D.
    Cui, Claire
    Corrado, Greg S.
    Dean, Jeffrey
    [J]. NPJ DIGITAL MEDICINE, 2018, 1
  • [55] Rasmussen CE, 2005, ADAPT COMPUT MACH LE, P1
  • [56] Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
    Rupp, Matthias
    Tkatchenko, Alexandre
    Mueller, Klaus-Robert
    von Lilienfeld, O. Anatole
    [J]. PHYSICAL REVIEW LETTERS, 2012, 108 (05)
  • [57] Using neural network ensembles for bankruptcy prediction and credit scoring
    Tsai, Chih-Fong
    Wu, Jhen-Wei
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2008, 34 (04) : 2639 - 2649
  • [58] Vervloet M, 2015, J STAT SOFTW, V65, P1
  • [59] On the selection of the weighting parameter value in Principal Covariates Regression
    Vervloet, Marlies
    Van Deun, Katrijn
    Van den Noortgate, Wim
    Ceulemans, Eva
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2013, 123 : 36 - 43
  • [60] Maximum Ambiguity-Based Sample Selection in Fuzzy Decision Tree Induction
    Wang, Xi-Zhao
    Dong, Ling-Cai
    Yan, Jian-Hui
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (08) : 1491 - 1505