Efficient Model-Free Subsampling Method for Massive Data

被引：2

作者：

Zhou, Zheng ^{[1
]}

Yang, Zebin ^{[2
]}

Zhang, Aijun ^{[2
]}

Zhou, Yongdao ^{[1
,3
]}

机构：

[1] Nankai Univ, Sch Stat & Data Sci, NITFID, Tianjin, Peoples R China

[2] Univ Hong Kong, Dept Stat & Actuarial Sci, Hong Kong, Peoples R China

[3] Nankai Univ, Sch Stat & Data Sci, NITFID, Tianjin 300071, Peoples R China

来源：

TECHNOMETRICS | 2024年 / 66卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Big data subsampling; Model robustness; Parallel computing; Uniform designs; VARIANCE TEST; DISCREPANCY;

D O I：

10.1080/00401706.2023.2271091

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.

引用

页码：240 / 252

页数：13

共 50 条

[21] Efficient model-free deconvolution of measured femtosecond kinetic data using a genetic algorithm
Keszei, Ernoe
JOURNAL OF CHEMOMETRICS, 2009, 23 (3-4) : 188 - 196
[22] A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data
Xue, Jingnan
Liang, Faming
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2017, 26 (04) : 803 - 813
[23] Deterministic subsampling for logistic regression with massive data
Song, Yan
Dai, Wenlin
COMPUTATIONAL STATISTICS, 2024, 39 (02) : 709 - 732
[24] Optimal subsampling for modal regression in massive data
Chao, Yue
Huang, Lei
Ma, Xuejun
Sun, Jiajun
METRIKA, 2024, 87 (04) : 379 - 409
[25] Optimal subsampling for multiplicative regression with massive data
Wang, Tianzhen
Zhang, Haixiang
STATISTICA NEERLANDICA, 2022, 76 (04) : 418 - 449
[26] Optimal subsampling for modal regression in massive data
Yue Chao
Lei Huang
Xuejun Ma
Jiajun Sun
Metrika, 2024, 87 : 379 - 409
[27] Model-free deconvolution of femtosecond kinetic data
Banyasz, Akos
Keszei, Erno
JOURNAL OF PHYSICAL CHEMISTRY A, 2006, 110 (19): : 6192 - 6207
[28] Assessment of model-free data in corneal topography
Jongsma, FHM
DeBrabander, J
Stultiens, BAT
Hendrikse, F
VISION RESEARCH, 1996, 36 : 85 - 85
[29] A toolbox for model-free analysis of fMRI data
Gruber, P.
Kohler, C.
Theis, F. J.
INDEPENDENT COMPONENT ANALYSIS AND SIGNAL SEPARATION, PROCEEDINGS, 2007, 4666 : 209 - +
[30] Deterministic subsampling for logistic regression with massive data
Yan Song
Wenlin Dai
Computational Statistics, 2024, 39 : 709 - 732

← 1 2 3 4 5 →