Core-elements for large-scale least squares estimation

被引:0
作者
Li, Mengyu [1 ]
Yu, Jun [2 ]
Li, Tao [1 ]
Meng, Cheng [1 ]
机构
[1] Renmin Univ China, Inst Stat & Big Data, Ctr Appl Stat, Beijing 100872, Peoples R China
[2] Beijing Inst Technol, Sch Math & Stat, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
Coresets; Linear model; Sparse matrix; Subset selection; NYSTROM APPROXIMATION; K-MEANS; MATRIX; COMPUTATION;
D O I
10.1007/s11222-024-10505-6
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\textrm{nnz}(X)+rp<^>2)$$\end{document} computational cost, where X is an nxp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n\times p$$\end{document} predictor matrix, r is the number of elements selected from each column of X, and nnz(<middle dot>)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{nnz}(\cdot )$$\end{document} denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Discharge prediction in partly vegetated channel flows: Adaptation of IDCM method with a curved interface and large-scale roughness elements
    Ben Meftah, Mouldi
    Mossa, Michele
    JOURNAL OF HYDROLOGY, 2023, 616
  • [42] The effects of large-scale fragmentation on bryophytes in temperate forests
    Pharo, EJ
    Lindenmayer, DB
    Taws, N
    JOURNAL OF APPLIED ECOLOGY, 2004, 41 (05) : 910 - 921
  • [43] A SAMPLING APPROXIMATION FOR LARGE-SCALE K-MEANS
    Phoungphol, Piyaphol
    ICAART: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, VOL 1, 2012, : 324 - 327
  • [44] A review of Nystrom methods for large-scale machine learning
    Sun, Shiliang
    Zhao, Jing
    Zhu, Jiang
    INFORMATION FUSION, 2015, 26 : 36 - 48
  • [45] Piezotronic materials and large-scale piezotronics array devices
    Hu, Weiguo
    Kalantar-Zadeh, Kourosh
    Gupta, Kapil
    Liu, Chuan-Pu
    MRS BULLETIN, 2018, 43 (12) : 936 - 940
  • [46] Chaotic dynamics of large-scale structures in a turbulent wake
    Varon, Eliott
    Eulalie, Yoann
    Edwige, Stephie
    Gilotte, Philippe
    Aider, Jean-Luc
    PHYSICAL REVIEW FLUIDS, 2017, 2 (03):
  • [47] Outsourcing Large-Scale Quadratic Programming to a Public Cloud
    Zhou, Lifeng
    Li, Chunguang
    IEEE ACCESS, 2015, 3 : 2581 - 2589
  • [48] Scalable k-means for large-scale clustering
    Ming, Yuewei
    Zhu, En
    Wang, Mao
    Liu, Qiang
    Liu, Xinwang
    Yin, Jianping
    INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 825 - 838
  • [49] Large-Scale System Identification Using a Randomized SVD
    Wang, Han
    Anderson, James
    2022 AMERICAN CONTROL CONFERENCE, ACC, 2022, : 2178 - 2185
  • [50] Hardware Acceleration of Large-Scale CMOS Invertible Logic Based on Sparse Hamiltonian Matrices
    Onizawa, Naoya
    Tamakoshi, Akira
    Hanyu, Takahiro
    IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS, 2021, 2 : 782 - 791