Optimal subsample selection for massive logistic regression with distributed data

被引:17
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [31] Accounting for informatively missing data in logistic regression by means of reassessment sampling
    Lin, Ji
    Lyles, Robert H.
    STATISTICS IN MEDICINE, 2015, 34 (11) : 1925 - 1939
  • [32] Sentiment classification on Big Data using Naive Bayes and Logistic Regression
    Prabhat, Anjuman
    Khullar, Vikas
    2017 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2017,
  • [33] Single-index composite quantile regression for massive data
    Jiang, Rong
    Yu, Keming
    JOURNAL OF MULTIVARIATE ANALYSIS, 2020, 180
  • [34] Optimal subsampling for quantile regression in big data
    Wang, Haiying
    Ma, Yanyuan
    BIOMETRIKA, 2021, 108 (01) : 99 - 112
  • [35] Optimal designs for multivariate logistic mixed models with longitudinal data
    Jiang, Hong-Yan
    Yue, Rong-Xian
    Zhou, Xiao-Dong
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2019, 48 (04) : 850 - 864
  • [36] Simultaneous selection of optimal bandwidths for the sharp regression discontinuity estimator
    Arai, Yoichi
    Ichimura, Hidehiko
    QUANTITATIVE ECONOMICS, 2018, 9 (01) : 441 - 482
  • [37] Communication-Constrained Distributed Quantile Regression with Optimal Statistical Guarantees
    Tan, Kean Ming
    Battey, Heather
    Zhou, Wen-Xin
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23 : 1 - 61
  • [38] Genomic-Enabled Prediction of Ordinal Data with Bayesian Logistic Ordinal Regression
    Montesinos-Lopez, Osval A.
    Montesinos-Lopez, Abelardo
    Crossa, Jose
    Burgueno, Juan
    Eskridge, Kent
    G3-GENES GENOMES GENETICS, 2015, 5 (10): : 2113 - 2126
  • [39] Variable selection with LASSO regression for complex survey data
    Iparragirre, Amaia
    Lumley, Thomas
    Barrio, Irantzu
    Arostegui, Inmaculada
    STAT, 2023, 12 (01):
  • [40] Optimal Data-Driven Regression Discontinuity Plots
    Calonico, Sebastian
    Cattaneo, Matias D.
    Titiunik, Rocio
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2015, 110 (512) : 1753 - 1769