Improving sample and feature selection with principal covariates regression

被引：25

作者：

Cersonsky, Rose K. ^{[1
]}

Helfrecht, Benjamin A. ^{[1
]}

Engel, Edgar A. ^{[2
]}

Kliavinek, Sergei ^{[1
]}

Ceriotti, Michele ^{[1
]}

机构：

[1] Ecole Polytech Fed Lausanne, Lab Computat Sci & Modeling, IMX, CH-1015 Lausanne, Switzerland

[2] Univ Cambridge, Cavendish Lab, TCM Grp, JJ Thomson Ave, Cambridge CB3 0HE, England

来源：

MACHINE LEARNING-SCIENCE AND TECHNOLOGY | 2021年 / 2卷 / 03期

关键词：

machine learning; feature selection; sample selection; farthest point sampling; materials science; physical chemistry; semi-supervised learning; SINGULAR VALUE DECOMPOSITION; RANK-ONE MODIFICATION; NETWORKS; INPUT;

D O I：

10.1088/2632-2153/abfe7c

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it improves the computational performance and often the transferability of a model. Here we focus on two popular subselection schemes applied to this end: CUR decomposition, derived from a low-rank approximation of the feature matrix, and farthest point sampling (FPS), which relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the principal covariates (PCov) regression method. We show how this results in selections that perform better in supervised tasks, demonstrating with models of increasing complexity, from ridge regression to kernel ridge regression and finally feed-forward neural networks. We also present adjustments to minimise the impact of any subselection when performing unsupervised tasks. We demonstrate the significant improvements associated with PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples required to achieve a given level of regression accuracy.

引用

页数：16

共 65 条

[51] Sensitivity and dimensionality of atomic environment representations used for machine learning interatomic potentials
Onat, Berk
Ortner, Christoph
Kermode, James R.
[J]. JOURNAL OF CHEMICAL PHYSICS, 2020, 153 (14)
[52] Chemical shifts in molecular solids by machine learning
Paruzzo, Federico M.
Hofstetter, Albert
Musil, Felix
De, Sandip
Ceriotti, Michele
Emsley, Lyndon
[J]. NATURE COMMUNICATIONS, 2018, 9
[53] Extreme Learning Machine-Based Classification of ADHD Using Brain Structural MRI Data
Peng, Xiaolong
Lin, Pan
Zhang, Tongsheng
Wang, Jue
[J]. PLOS ONE, 2013, 8 (11):
[54] Scalable and accurate deep learning with electronic health records
Rajkomar, Alvin
Oren, Eyal
Chen, Kai
Dai, Andrew M.
Hajaj, Nissan
Hardt, Michaela
Liu, Peter J.
Liu, Xiaobing
Marcus, Jake
Sun, Mimi
Sundberg, Patrik
Yee, Hector
Zhang, Kun
Zhang, Yi
Flores, Gerardo
Duggan, Gavin E.
Irvine, Jamie
Quoc Le
Litsch, Kurt
Mossin, Alexander
Tansuwan, Justin
Wang, De
Wexler, James
Wilson, Jimbo
Ludwig, Dana
Volchenboum, Samuel L.
Chou, Katherine
Pearson, Michael
Madabushi, Srinivasan
Shah, Nigam H.
Butte, Atul J.
Howell, Michael D.
Cui, Claire
Corrado, Greg S.
Dean, Jeffrey
[J]. NPJ DIGITAL MEDICINE, 2018, 1
[55] Rasmussen CE, 2005, ADAPT COMPUT MACH LE, P1
[56] Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
Rupp, Matthias
Tkatchenko, Alexandre
Mueller, Klaus-Robert
von Lilienfeld, O. Anatole
[J]. PHYSICAL REVIEW LETTERS, 2012, 108 (05)
[57] Using neural network ensembles for bankruptcy prediction and credit scoring
Tsai, Chih-Fong
Wu, Jhen-Wei
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2008, 34 (04) : 2639 - 2649
[58] Vervloet M, 2015, J STAT SOFTW, V65, P1
[59] On the selection of the weighting parameter value in Principal Covariates Regression
Vervloet, Marlies
Van Deun, Katrijn
Van den Noortgate, Wim
Ceulemans, Eva
[J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2013, 123 : 36 - 43
[60] Maximum Ambiguity-Based Sample Selection in Fuzzy Decision Tree Induction
Wang, Xi-Zhao
Dong, Ling-Cai
Yan, Jian-Hui
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (08) : 1491 - 1505

← 1 2 3 4 5 6 7 →