Supervised clustering of variables based on Gram-Schmidt transformation

被引:0
|
作者
Liu R. [1 ]
Wang H. [1 ,2 ]
Wang S. [1 ,3 ]
机构
[1] School of Economics and Management, Beihang University, Beijing
[2] Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing
[3] Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations, Beijing
来源
Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics | 2019年 / 45卷 / 10期
基金
中国国家自然科学基金;
关键词
Dimension reduction; Gram-Schmidt transformation; High correlation; Regression; Variable clustering;
D O I
10.13700/j.bh.1001-5965.2019.0050
中图分类号
学科分类号
摘要
In order to study the dimension reduction method of high-dimensional data based on regression model further, and the supervised clustering of variables algorithm based on Gram-Schmidt transformation (SCV-GS) is proposed. SCV-GS uses the key variables selected in turn by the variable screening idea as the clustering center, which is different from the hierarchical variable clustering around latent variables. High correlation among variables is processed based on Gram-Schmidt transformation and the clustering results are obtained. At the same time, combined with the concept of partial least squares, a new criterion for "homogeneity" is proposed to select the optimal clustering parameters. SCV-GS can not only get the variable clustering results quickly, but also identify the most relevant variable groups and in what kind of structure the variables work to influence the response variable. Simulation results show that the calculation speed is significantly improved by SCV-GS, and the estimated regression coefficients corresponding to the latent variables are consistent with the comparison method. Real data analysis shows that SCV-GS performs better in interpretation and prediction. © 2019, Editorial Board of JBUAA. All right reserved.
引用
收藏
页码:2003 / 2010
页数:7
相关论文
共 23 条
  • [1] Tibshirani R., Regression shrinkage and selection via the lasso: A retrospective, Journal of the Royal Statistical Society: Series B(Statistica Methodology), 73, 3, pp. 273-282, (2011)
  • [2] Zou H., Hastie T., Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B(Statistical Methodology), 67, 2, pp. 301-320, (2005)
  • [3] Fan J.Q., Lv J.C., Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B(Statistical Methodology), 70, 5, pp. 849-911, (2008)
  • [4] Wang H.S., Forward regression for ultra-high dimensional variable screening, Journal of the American Statistical Association, 104, 488, pp. 1512-1524, (2009)
  • [5] Zou H., Hastie T., Tibshirani R., Sparse principal component analysis, Journal of Computational and Graphical Statistics, 15, 2, pp. 265-286, (2006)
  • [6] Chun H., Keles S., Sparse partial least squares regression for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society: Series B(Statistical Methodology), 72, 1, pp. 3-25, (2010)
  • [7] Chen M.K., Vigneau E., Supervised clustering of variables, Advances in Data Analysis and Classification, 10, 1, pp. 85-101, (2016)
  • [8] Jolliffe I.T., Discarding variables in a principal component analysis. I: Artificial data, Applied Statistics, 21, 2, pp. 160-173, (1972)
  • [9] Hastie T., Tibshirani R., Botstein D., Et al., Supervised harvesting of expression trees, Genome Biology, 2, 1, (2001)
  • [10] Vigneau E., Qannari E., Clustering of variables around latent components, Communications in Statistics-Simulation and Computation, 32, 4, pp. 1131-1150, (2003)