Core-elements for large-scale least squares estimation

被引：0

作者：

Li, Mengyu ^{[1
]}

Yu, Jun ^{[2
]}

Li, Tao ^{[1
]}

Meng, Cheng ^{[1
]}

机构：

[1] Renmin Univ China, Inst Stat & Big Data, Ctr Appl Stat, Beijing 100872, Peoples R China

[2] Beijing Inst Technol, Sch Math & Stat, Beijing 100081, Peoples R China

来源：

STATISTICS AND COMPUTING | 2024年 / 34卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Coresets; Linear model; Sparse matrix; Subset selection; NYSTROM APPROXIMATION; K-MEANS; MATRIX; COMPUTATION;

D O I：

10.1007/s11222-024-10505-6

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\textrm{nnz}(X)+rp<^>2)$$\end{document} computational cost, where X is an nxp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n\times p$$\end{document} predictor matrix, r is the number of elements selected from each column of X, and nnz(<middle dot>)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{nnz}(\cdot )$$\end{document} denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.

引用

页数：16

共 50 条

[21] Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem
Mu, Yang
Ding, Wei
Zhou, Tianyi
Tao, Dacheng
19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 883 - 891
[22] INCREMENTAL REGULARIZED LEAST SQUARES FOR DIMENSIONALITY REDUCTION OF LARGE-SCALE DATA
Zhang, Xiaowei
Cheng, Li
Chu, Delin
Liao, Li-Zhi
Ng, Michael K.
Tan, Roger C. E.
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2016, 38 (03): : B414 - B439
[23] Analysis of extended partial least squares for monitoring large-scale processes
Chen, Q
Kruger, U
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, 2005, 13 (05) : 807 - 813
[24] Integration of the monte carlo covariance estimation strategy into tailored solution procedures for large-scale least squares problems
Alkhatib, H.
Schuh, W. -D.
JOURNAL OF GEODESY, 2007, 81 (01) : 53 - 66
[25] APPLICABILITY OF LEAST-SQUARES TO THE PARAMETER-ESTIMATION OF LARGE-SCALE NO-MEMORY LINEAR COMPOSITE SYSTEMS
HASIEWICZ, Z
INTERNATIONAL JOURNAL OF SYSTEMS SCIENCE, 1989, 20 (12) : 2427 - 2449
[26] Integration of the Monte Carlo Covariance Estimation Strategy into Tailored Solution Procedures for Large-Scale Least Squares Problems
H. Alkhatib
W. -D. Schuh
Journal of Geodesy, 2007, 81 : 53 - 66
[27] LAGRANGIAN APPROACH FOR LARGE-SCALE LEAST ABSOLUTE VALUE ESTIMATION
SKLAR, MG
ARMSTRONG, RD
COMPUTERS & OPERATIONS RESEARCH, 1993, 20 (01) : 83 - 93
[28] Large-Scale Fuzzy Least Squares Twin SVMs for Class Imbalance Learning
Ganaie, M. A.
Tanveer, M.
Lin, Chin-Teng
IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2022, 30 (11) : 4815 - 4827
[29] NONLINEAR LEAST-SQUARES APPROACH FOR LARGE-SCALE ALGEBRAIC RICCATI EQUATIONS
Jbilou, Khalide
Raydan, Marcos
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2019, 41 (04): : A2193 - A2211
[30] BLOCK AOR ITERATIVE SCHEMES FOR LARGE-SCALE LEAST-SQUARES PROBLEMS
PAPADOPOULOU, EP
SARIDAKIS, YG
PAPATHEODOROU, TS
SIAM JOURNAL ON NUMERICAL ANALYSIS, 1989, 26 (03) : 637 - 660

← 1 2 3 4 5 →