Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

被引:0
作者
Zhang, Sheng [1 ]
Tan, Fei [1 ]
Peng, Hanxiang [1 ]
机构
[1] Indiana Univ Indianapolis, Dept Math Sci, 402 N Blackford St LD 270, Indianapolis, IN 46202 USA
关键词
Asymptotic normality; A-optimalilty; big data; least squares estimate; sample size determination; APPROXIMATION;
D O I
10.1080/00949655.2024.2434669
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
To efficiently approximate the least squares estimator (LSE) in a Big Data linear regression model using a subsampling approach, optimal sampling distributions were derived by minimizing the trace norm of the covariance matrix of a smooth function of the subsampling LSE. An algorithm was developed that significantly reduces the computation time for the subsampling LSE compared to the full-sample LSE. Additionally, the subsampling LSE was shown to be asymptotically normal almost surely for an arbitrary sampling distribution under suitable conditions. Motivated by the need for subsampling in Big Data analysis and data splitting in machine learning, we investigated sample size determination (SSD) for multidimensional parameters and derived analytical formulas for calculating sample sizes. Through extensive simulations and real-world data applications, we assessed the numerical properties of both the subsampling approach and SSD methodology. Our findings revealed that the A-optimal subsampling method significantly outperformed uniform and leverage-score subsampling techniques. Furthermore, the algorithm considerably reduced the computational time required for implementing the full sample LSE. Additionally, the SSD provided a theoretical basis for selecting sample sizes.
引用
收藏
页码:628 / 653
页数:26
相关论文
共 31 条
  • [1] OPTIMAL SUBSAMPLING ALGORITHMS FOR BIG DATA REGRESSIONS
    Ai, Mingyao
    Yu, Jun
    Zhang, Huiming
    Wang, HaiYing
    [J]. STATISTICA SINICA, 2021, 31 (02) : 749 - 772
  • [2] Barbe P., 1995, WEIGHTED BOOTSTRAP
  • [3] SLLN for weighted independent identically distributed random variables
    Baxter, J
    Jones, R
    Lin, M
    Olsen, J
    [J]. JOURNAL OF THEORETICAL PROBABILITY, 2004, 17 (01) : 165 - 181
  • [4] Exact Matrix Completion via Convex Optimization
    Candes, Emmanuel J.
    Recht, Benjamin
    [J]. FOUNDATIONS OF COMPUTATIONAL MATHEMATICS, 2009, 9 (06) : 717 - 772
  • [5] Dimension asymptotics for generalised bootstrap in linear regression
    Chatterjee, S
    Bose, A
    [J]. ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2002, 54 (02) : 367 - 381
  • [6] Chung, 2001, COURSE PROBABILITY T
  • [7] Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication
    Drineas, Petros
    Kannan, Ravi
    Mahoney, Michael W.
    [J]. SIAM JOURNAL ON COMPUTING, 2006, 36 (01) : 132 - 157
  • [8] Drineas P, 2012, J MACH LEARN RES, V13, P3475
  • [9] On the theory of sampling from finite populations
    Hansen, MH
    Hurwitz, WN
    [J]. ANNALS OF MATHEMATICAL STATISTICS, 1943, 14 : 333 - 362
  • [10] A scalable bootstrap for massive data
    Kleiner, Ariel
    Talwalkar, Ameet
    Sarkar, Purnamrita
    Jordan, Michael I.
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2014, 76 (04) : 795 - 816