Big Data Model Building Using Dimension Reduction and Sample Selection

被引:0
作者
Deng, Lih-Yuan [1 ]
Yang, Ching-Chi [1 ]
Bowman, Dale [1 ]
Lin, Dennis K. J. [2 ]
Lu, Henry Horng-Shing [3 ,4 ]
机构
[1] Univ Memphis, Dept Math Sci, Memphis, TN USA
[2] Purdue Univ, Dept Stat, W Lafayette, IN USA
[3] Natl Yang Ming Chiao Tung Univ, Inst Stat, Hsinchu, Taiwan
[4] Cornell Univ, Dept Stat & Data Sci, Ithaca, NY 14850 USA
基金
美国国家科学基金会;
关键词
Dimension reduction; GAM; IBOSS; Space-filling design;
D O I
10.1080/10618600.2023.2260052
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
It is difficult to handle the extraordinary data volume generated in many fields with current computational resources and techniques. This is very challenging when applying conventional statistical methods to big data. A common approach is to partition full data into smaller subdata for purposes such as training, testing, and validation. The primary purpose of training data is to represent the full data. To achieve this goal, the selection of training subdata becomes pivotal in retaining essential characteristics of the full data. Recently, several procedures have been proposed to select "optimal design points" as training subdata under pre-specified models, such as linear regression and logistic regression. However, these subdata will not be "optimal" if the assumed model is not appropriate. Furthermore, such subdata cannot be useful to build alternative models because it is not an appropriate representative sample of the full data. In this article, we propose a novel algorithm for better model building and prediction via a process of selecting a "good" training sample. The proposed subdata can retain most characteristics of the original big data. It is also more robust that one can fit various response model and select the optimal model. Supplementary materials for this article are available online.
引用
收藏
页码:435 / 447
页数:13
相关论文
共 19 条
  • [1] Principal component analysis
    Abdi, Herve
    Williams, Lynne J.
    [J]. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04): : 433 - 459
  • [2] Optimal subsampling for large-scale quantile regression
    Ai, Mingyao
    Wang, Fei
    Yu, Jun
    Zhang, Huiming
    [J]. JOURNAL OF COMPLEXITY, 2021, 62
  • [3] Anderson T, 2003, An introduction to multivariate statistical analysis (wiley series in probability and statistics)
  • [4] [Anonymous], 1936, Proc. Natl. Inst. Sci. India, DOI [DOI 10.1007/S13171-019-00164-5, 10.1007/s13171-019-00164-5]
  • [5] Information-based optimal subdata selection for big data logistic regression
    Cheng, Qianshun
    Wang, HaiYing
    Yang, Min
    [J]. JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2020, 209 : 112 - 122
  • [6] Cochran W.G, 1977, Sampling Techniques, V3rd ed.
  • [7] Uniform design: Theory and application
    Fang, KT
    Lin, DKJ
    Winker, P
    Zhang, Y
    [J]. TECHNOMETRICS, 2000, 42 (03) : 237 - 248
  • [8] Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring
    Fonollosa, Jordi
    Sheik, Sadique
    Huerta, Ramon
    Marco, Santiago
    [J]. SENSORS AND ACTUATORS B-CHEMICAL, 2015, 215 : 618 - 629
  • [9] Covariance tapering for interpolation of large spatial datasets
    Furrer, Reinhard
    Genton, Marc G.
    Nychka, Douglas
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2006, 15 (03) : 502 - 523
  • [10] Local Gaussian Process Approximation for Large Computer Experiments
    Gramacy, Robert B.
    Apley, Daniel W.
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2015, 24 (02) : 561 - 578