Two-Stage Procedures for High-Dimensional Data

被引:55
作者
Aoshima, Makoto [1 ]
Yata, Kazuyoshi [1 ]
机构
[1] Univ Tsukuba, Inst Math, Tsukuba, Ibaraki 3058571, Japan
来源
SEQUENTIAL ANALYSIS-DESIGN METHODS AND APPLICATIONS | 2011年 / 30卷 / 04期
基金
日本学术振兴会;
关键词
Asymptotic normality; Classification; Confidence region; HDLSS; Lasso; Pathway analysis; Regression; Sample size determination; Testing equality of covariance matrices; Two-sample test; Variable selection; SAMPLE-SIZE DATA; GENE-EXPRESSION; GEOMETRIC REPRESENTATION; COVARIANCE MATRICES; LARGEST EIGENVALUE; PCA CONSISTENCY; DISCRIMINATION; CLASSIFICATION; CELL;
D O I
10.1080/07474946.2011.619088
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In this article, we consider a variety of inference problems for high-dimensional data. The purpose of this article is to suggest directions for future research and possible solutions about p >> n problems by using new types of two-stage estimation methodologies. This is the first attempt to apply sequential analysis to high-dimensional statistical inference ensuring prespecified accuracy. We offer the sample size determination for inference problems by creating new types of multivariate two-stage procedures. To develop theory and methodologies, the most important and basic idea is the asymptotic normality when p -> infinity. By developing asymptotic normality when p -> infinity, we first give (a) a given-bandwidth confidence region for the square loss. In addition, we give (b) a two-sample test to assure prespecified size and power simultaneously together with (c) an equality-test procedure for two covariance matrices. We also give (d) a two-stage discriminant procedure that controls misclassification rates being no more than a prespecified value. Moreover, we propose (e) a two-stage variable selection procedure that provides screening of variables in the first stage and selects a significant set of associated variables from among a set of candidate variables in the second stage. Following the variable selection procedure, we consider (f) variable selection for high-dimensional regression to compare favorably with the lasso in terms of the assurance of accuracy and the computational cost. Further, we consider variable selection for classification and propose (g) a two-stage discriminant procedure after screening some variables. Finally, we consider (h) pathway analysis for high-dimensional data by constructing a multiple test of correlation coefficients.
引用
收藏
页码:356 / 399
页数:44
相关论文
共 43 条
[1]   The high-dimension, low-sample-size geometric representation holds under mild conditions [J].
Ahn, Jeongyoun ;
Marron, J. S. ;
Muller, Keith M. ;
Chi, Yueh-Yun .
BIOMETRIKA, 2007, 94 (03) :760-766
[2]  
[Anonymous], AM MATH SOC TRANSL
[3]   Sequential procedures for selecting the most probable multinomial cell when a nuisance cell is present [J].
Aoshima, M ;
Chen, PY ;
Panchapakesan, S .
COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2003, 32 (04) :893-906
[4]   A two-stage procedure for estimating a linear function of K multinormal mean vectors when covariance matrices are unknown [J].
Aoshima, M ;
Takada, Y ;
Srivastava, MS .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2002, 100 (02) :109-119
[5]   Fixed-width simultaneous confidence intervals for multinormal means in several intraclass correlation models [J].
Aoshima, M ;
Mukhopadhyay, N .
JOURNAL OF MULTIVARIATE ANALYSIS, 1998, 66 (01) :46-63
[6]  
Aoshima M., 2004, SEQUENTIAL ANAL, V23, P333
[7]   Asymptotic second-order consistency for two-stage estimation methodologies and its applications [J].
Aoshima, Makoto ;
Yata, Kazuyoshi .
ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2010, 62 (03) :571-600
[8]  
Bai ZD, 1996, STAT SINICA, V6, P311
[9]   Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices [J].
Baik, J ;
Ben Arous, G ;
Péché, S .
ANNALS OF PROBABILITY, 2005, 33 (05) :1643-1697
[10]   Eigenvalues of large sample covariance matrices of spiked population models [J].
Baik, Jinho ;
Silverstein, Jack W. .
JOURNAL OF MULTIVARIATE ANALYSIS, 2006, 97 (06) :1382-1408