Optimal subsample selection for massive logistic regression with distributed data

被引:17
|
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [21] A novel variational Bayesian method for variable selection in logistic regression models
    Zhang, Chun-Xia
    Xu, Shuang
    Zhang, Jiang-She
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2019, 133 : 1 - 19
  • [22] Dynamic logistic regression and variable selection: Forecasting and contextualizing civil unrest
    Bakerman, Jordan
    Pazdernik, Karl
    Korkmaz, Gizem
    Wilson, Alyson G.
    INTERNATIONAL JOURNAL OF FORECASTING, 2022, 38 (02) : 648 - 661
  • [23] A connected network-regularized logistic regression model for feature selection
    Li, Lingyu
    Liu, Zhi-Ping
    APPLIED INTELLIGENCE, 2022, 52 (10) : 11672 - 11702
  • [24] Performance Evaluation of Enabling Logistic Regression for Big Data with R
    Huang, Ruizhu
    Xu, Weijia
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2517 - 2524
  • [25] Streaming constrained binary logistic regression with online standardized data
    Lalloue, Benoit
    Monnez, Jean-Marie
    Albuisson, Eliane
    JOURNAL OF APPLIED STATISTICS, 2022, 49 (06) : 1519 - 1539
  • [26] Distributed quantile regression for longitudinal big data
    Ye Fan
    Nan Lin
    Liqun Yu
    Computational Statistics, 2024, 39 : 751 - 779
  • [27] On the Feasibility of Distributed Kernel Regression for Big Data
    Xu, Chen
    Zhang, Yongquan
    Li, Runze
    Wu, Xindong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (11) : 3041 - 3052
  • [28] Bag of little bootstraps for massive and distributed longitudinal data
    Zhou, Xinkai
    Zhou, Jin J.
    Zhou, Hua
    STATISTICAL ANALYSIS AND DATA MINING, 2022, 15 (03) : 314 - 321
  • [29] Communication-efficient distributed large-scale sparse multinomial logistic regression
    Lei, Dajiang
    Huang, Jie
    Chen, Hao
    Li, Jie
    Wu, Yu
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (18)
  • [30] Analyzing SQL payloads using logistic regression in a big data environment
    Shareef, Omar Salah F.
    Hasan, Rehab Flaih
    Farhan, Ammar Hatem
    JOURNAL OF INTELLIGENT SYSTEMS, 2023, 32 (01)