ADAPTIVE ESTIMATION IN STRUCTURED FACTOR MODELS WITH APPLICATIONS TO OVERLAPPING CLUSTERING

被引：16

作者：

Bing, Xin ^{[1
]}

Bunea, Florentina ^{[1
]}

Ning, Yang ^{[1
]}

Wegkamp, Marten ^{[1
,2
]}

机构：

[1] Cornell Univ, Dept Stat & Data Sci, Ithaca, NY 14853 USA

[2] Cornell Univ, Dept Math, White Hall, Ithaca, NY 14853 USA

来源：

ANNALS OF STATISTICS | 2020年 / 48卷 / 04期

关键词：

Overlapping clustering; latent model; identification; high-dimensional estimation; minimax estimation; pure variables; group recovery; support recovery; sparse loading matrix; matrix factorization; adaptive estimation; COVARIANCE ESTIMATION; MATRIX; SELECTION; RANK; DECOMPOSITION; ALGORITHMS;

D O I：

10.1214/19-AOS1877

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

This work introduces a novel estimation method, called LOVE, of the entries and structure of a loading matrix A in a latent factor model X = AZ + E, for an observable random vector X is an element of R-p, with correlated unobservable factors Z is an element of R-K, with K unknown, and uncorrelated noise E. Each row of A is scaled, and allowed to be sparse. In order to identify the loading matrix A, we require the existence of pure variables, which are components of X that are associated, via A, with one and only one latent factor. Despite the fact that the number of factors K, the number of the pure variables and their location are all unknown, we only require a mild condition on the covariance matrix of Z, and a minimum of only two pure variables per latent factor to show that A is uniquely defined, up to signed permutations. Our proofs for model identifiability are constructive, and lead to our novel estimation method of the number of factors and of the set of pure variables, from a sample of size n of observations on X. This is the first step of our LOVE algorithm, which is optimization-free, and has low computational complexity of order p(2). The second step of LOVE is an easily implementable linear program that estimates A. We prove that the resulting estimator is near minimax rate optimal for A, with respect to the parallel to parallel to(infinity)(,q) loss, for q >= 1, up to logarithmic factors in p, and that it can be minimax-rate optimal in many cases of interest. The model structure is motivated by the problem of overlapping variable clustering, ubiquitous in data science. We define the population level clusters as groups of those components of X that are associated, via the matrix A, with the same unobservable latent factor, and multifactor association is allowed. Clusters are respectively anchored by the pure variables, and form overlapping subgroups of the p-dimensional random vector X. The Latent model approach to OVErlapping clustering is reflected in the name of our algorithm, LOVE. The third step of LOVE estimates the clusters from the support of the columns of the estimated A. We guarantee cluster recovery with zero false positive proportion, and with false negative proportion control. The practical relevance of LOVE is illustrated through the analysis of a RNA-seq data set, devoted to determining the functional annotation of genes with unknown function.

引用

页码：2055 / 2081

页数：27

共 51 条

[1] THE ASYMPTOTIC NORMAL-DISTRIBUTION OF ESTIMATORS IN FACTOR-ANALYSIS UNDER GENERAL CONDITIONS [J].

ANDERSON, TW ;

AMEMIYA, Y .

ANNALS OF STATISTICS, 1988, 16 (02) :759-771

[2]

Anderson TW., 2003, An introduction to multivariate statistical analysis, V3

[3]

Anderson TW., 1956, Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, 5, V5, P111

[4]

[Anonymous], 2013, P 30 INT C MACHINE L

[5]

[Anonymous], ARXIV180506837

[6] Gene Ontology: tool for the unification of biology [J].

Ashburner, M ;

Ball, CA ;

Blake, JA ;

Botstein, D ;

Butler, H ;

Cherry, JM ;

Davis, AP ;

Dolinski, K ;

Dwight, SS ;

Eppig, JT ;

Harris, MA ;

Hill, DP ;

Issel-Tarver, L ;

Kasarskis, A ;

Lewis, S ;

Matese, JC ;

Richardson, JE ;

Ringwald, M ;

Rubin, GM ;

Sherlock, G .

NATURE GENETICS, 2000, 25 (01) :25-29

[7] Determining the number of factors in approximate factor models [J].

Bai, JS ;

Ng, S .

ECONOMETRICA, 2002, 70 (01) :191-221

[8] STATISTICAL ANALYSIS OF FACTOR MODELS OF HIGH DIMENSION [J].

Bai, Jushan ;

Li, Kunpeng .

ANNALS OF STATISTICS, 2012, 40 (01) :436-465

[9] Generic global indentification in factor analysis [J].

Bekker, PA ;

tenBerge, JMF .

LINEAR ALGEBRA AND ITS APPLICATIONS, 1997, 264 :255-263

[10] Linear and conic programming estimators in high dimensional errors-in-variables models [J].

Belloni, Alexandre ;

Rosenbaum, Mathieu ;

Tsybakov, Alexandre B. .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2017, 79 (03) :939-956

← 1 2 3 4 5 6 →