A statistical view of column subset selection

被引:0
作者
Sood, Anav [1 ]
Hastie, Trevor [1 ]
机构
[1] Stanford Univ, Dept Stat, Sequoia Hall,390 Jane Stanford Way, Stanford, CA 94305 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
column subset selection; high-dimensional statistics; interpretable dimensionality reduction; principal components analysis; principal variables; probabilistic modelling; PRINCIPAL COMPONENT ANALYSIS; REVEALING QR FACTORIZATIONS; MATRIX DECOMPOSITION; VARIABLE SELECTION; BAND SELECTION; RANK; ALGORITHMS; PERSONALITY; CLASSIFICATION; COMPUTATION;
D O I
10.1093/jrsssb/qkaf023
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as column subset selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of principal variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum-likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
引用
收藏
页数:22
相关论文
共 86 条
[81]   Feature subset selection and ranking for data dimensionality reduction [J].
Wei, Hua-Liang ;
Billings, Stephen A. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (01) :162-166
[82]   On hierarchical correlation systems [J].
Wilson, EB .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1928, 14 :283-291
[83]   A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis [J].
Witten, Daniela M. ;
Tibshirani, Robert ;
Hastie, Trevor .
BIOSTATISTICS, 2009, 10 (03) :515-534
[84]   Model selection and estimation in regression with grouped variables [J].
Yuan, M ;
Lin, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2006, 68 :49-67
[85]  
Zhang G., 2022, EFAutilities: Utility functions for exploratory factor analysis
[86]   Sparse principal component analysis [J].
Zou, Hui ;
Hastie, Trevor ;
Tibshirani, Robert .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2006, 15 (02) :265-286