Streaming and Distributed Algorithms for Robust Column Subset Selection

被引:0
作者
Jiang, Shuli [1 ]
Li, Dongyu [1 ]
Li, Irene Mengze [1 ]
Mahankali, Arvind, V [1 ]
Woodruff, David P. [1 ]
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | 2021年 / 139卷
关键词
MATRIX; APPROXIMATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise l(p)-norm with 1 <= p < 2. We study the l(p) norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix A is an element of R-dxn (n >> d), our algorithm achieves a multiplicative k(1/p-1/2)-poly(log nd)-approximation to the error with respect to the best possible column subset of size k. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the l(p,2)-norm, which corresponds to the p-norm of the vector of Euclidean norms of each of the columns of A. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the l(1,2) norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.
引用
收藏
页数:11
相关论文
共 40 条
  • [1] Altschuler J, 2016, PR MACH LEARN RES, V48
  • [2] [Anonymous], 2015, CORR
  • [3] Balcan M., 2015, CORR
  • [4] Communication Efficient Distributed Kernel Principal Component Analysis
    Balcan, Maria-Florina
    Liang, Yingyu
    Song, Le
    Woodruff, David
    Xie, Bo
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 725 - 734
  • [5] Balcan Maria-Florina, 2014, Advances in Neural Information Processing Systems, V27
  • [6] Ban F, 2019, Disc Algorithms, P747
  • [7] APPROXIMATION OF ZONOIDS BY ZONOTOPES
    BOURGAIN, J
    LINDENSTRAUSS, J
    MILMAN, V
    [J]. ACTA MATHEMATICA, 1989, 162 (1-2) : 73 - 141
  • [8] Boutsidis C., 2008, 14 ACM SGKDD INT C K, P61
  • [9] Boutsidis C., 2008, ARXIV08124293
  • [10] Optimal Principal Component Analysis in Distributed and Streaming Models
    Boutsidis, Christos
    Woodruff, David P.
    Zhong, Peilin
    [J]. STOC'16: PROCEEDINGS OF THE 48TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING, 2016, : 236 - 249