Optimal subsample selection for massive logistic regression with distributed data

被引:17
|
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [1] Optimal subsample selection for massive logistic regression with distributed data
    Lulu Zuo
    Haixiang Zhang
    HaiYing Wang
    Liuquan Sun
    Computational Statistics, 2021, 36 : 2535 - 2562
  • [2] Distributed information-based optimal sub-data selection algorithm for big data logistic regression
    Wan, Xiangxin
    Liu, Yanyan
    Ye, Xin
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2025,
  • [3] Optimal subsampling for multiplicative regression with massive data
    Wang, Tianzhen
    Zhang, Haixiang
    STATISTICA NEERLANDICA, 2022, 76 (04) : 418 - 449
  • [4] Optimal subsampling for modal regression in massive data
    Chao, Yue
    Huang, Lei
    Ma, Xuejun
    Sun, Jiajun
    METRIKA, 2024, 87 (04) : 379 - 409
  • [5] Adaptive distributed support vector regression of massive data
    Liang, Shu-na
    Sun, Fei
    Zhang, Qi
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024, 53 (09) : 3365 - 3382
  • [6] Distributed smoothed rank regression with heterogeneous errors for massive data
    Yuan, Xiaohui
    Zhang, Xinran
    Wang, Yue
    Wang, Chunjie
    JOURNAL OF THE KOREAN STATISTICAL SOCIETY, 2023, 52 (04) : 1078 - 1103
  • [7] Subsample ignorable likelihood for regression analysis with missing data
    Little, Roderick J.
    Zhang, Nanhua
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2011, 60 : 591 - 605
  • [8] Optimal subsampling algorithms for composite quantile regression in massive data
    Jin, Jun
    Liu, Shuangzhe
    Ma, Tiefeng
    STATISTICS, 2023, 57 (04) : 811 - 843
  • [9] Bayesian variable selection for logistic regression
    Tian, Yiqing
    Bondell, Howard D.
    Wilson, Alyson
    STATISTICAL ANALYSIS AND DATA MINING, 2019, 12 (05) : 378 - 393
  • [10] Optimal subsampling for large-sample quantile regression with massive data
    Shao, Li
    Song, Shanshan
    Zhou, Yong
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2023, 51 (02): : 420 - 443