Core-elements for large-scale least squares estimation

被引:0
|
作者
Li, Mengyu [1 ]
Yu, Jun [2 ]
Li, Tao [1 ]
Meng, Cheng [1 ]
机构
[1] Renmin Univ China, Inst Stat & Big Data, Ctr Appl Stat, Beijing 100872, Peoples R China
[2] Beijing Inst Technol, Sch Math & Stat, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
Coresets; Linear model; Sparse matrix; Subset selection; NYSTROM APPROXIMATION; K-MEANS; MATRIX; COMPUTATION;
D O I
10.1007/s11222-024-10505-6
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\textrm{nnz}(X)+rp<^>2)$$\end{document} computational cost, where X is an nxp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n\times p$$\end{document} predictor matrix, r is the number of elements selected from each column of X, and nnz(<middle dot>)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{nnz}(\cdot )$$\end{document} denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem
    Mu, Yang
    Ding, Wei
    Zhou, Tianyi
    Tao, Dacheng
    19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 883 - 891
  • [22] INCREMENTAL REGULARIZED LEAST SQUARES FOR DIMENSIONALITY REDUCTION OF LARGE-SCALE DATA
    Zhang, Xiaowei
    Cheng, Li
    Chu, Delin
    Liao, Li-Zhi
    Ng, Michael K.
    Tan, Roger C. E.
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2016, 38 (03): : B414 - B439
  • [23] Analysis of extended partial least squares for monitoring large-scale processes
    Chen, Q
    Kruger, U
    IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, 2005, 13 (05) : 807 - 813
  • [24] Integration of the monte carlo covariance estimation strategy into tailored solution procedures for large-scale least squares problems
    Alkhatib, H.
    Schuh, W. -D.
    JOURNAL OF GEODESY, 2007, 81 (01) : 53 - 66
  • [26] Integration of the Monte Carlo Covariance Estimation Strategy into Tailored Solution Procedures for Large-Scale Least Squares Problems
    H. Alkhatib
    W. -D. Schuh
    Journal of Geodesy, 2007, 81 : 53 - 66
  • [27] LAGRANGIAN APPROACH FOR LARGE-SCALE LEAST ABSOLUTE VALUE ESTIMATION
    SKLAR, MG
    ARMSTRONG, RD
    COMPUTERS & OPERATIONS RESEARCH, 1993, 20 (01) : 83 - 93
  • [28] Large-Scale Fuzzy Least Squares Twin SVMs for Class Imbalance Learning
    Ganaie, M. A.
    Tanveer, M.
    Lin, Chin-Teng
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2022, 30 (11) : 4815 - 4827
  • [29] NONLINEAR LEAST-SQUARES APPROACH FOR LARGE-SCALE ALGEBRAIC RICCATI EQUATIONS
    Jbilou, Khalide
    Raydan, Marcos
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2019, 41 (04): : A2193 - A2211
  • [30] BLOCK AOR ITERATIVE SCHEMES FOR LARGE-SCALE LEAST-SQUARES PROBLEMS
    PAPADOPOULOU, EP
    SARIDAKIS, YG
    PAPATHEODOROU, TS
    SIAM JOURNAL ON NUMERICAL ANALYSIS, 1989, 26 (03) : 637 - 660