Optimal subsample selection for massive logistic regression with distributed data

被引:17
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [41] Estimation of logistic regression parameters for complex survey data: simulation study based on real survey data
    Iparragirre, Amaia
    Barrio, Irantzu
    Aramendi, Jorge
    Arostegui, Inmaculada
    SORT-STATISTICS AND OPERATIONS RESEARCH TRANSACTIONS, 2024, 48 (01) : 67 - 92
  • [42] Real-Time Semiparametric Regression for Distributed Data Sets
    Luts, Jan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (02) : 545 - 557
  • [43] A Distributed Storage and Access Approach for Massive Remote Sensing Data in MongoDB
    Wang, Shuang
    Li, Guoqing
    Yao, Xiaochuang
    Zeng, Yi
    Pang, Lushen
    Zhang, Lianchong
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2019, 8 (12)
  • [44] Additive Hazards Regression Analysis of Massive Interval-Censored Data via Data Splitting
    Huang, Peiyao
    Li, Shuwei
    Song, Xinyuan
    AMERICAN STATISTICIAN, 2024,
  • [45] Optimal and Efficient Distributed Online Learning for Big Data
    Sayin, Muhammed O.
    Vanli, N. Denizcan
    Delibalta, Ibrahim
    Kozat, Suleyman S.
    2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, : 126 - 133
  • [46] Implementation of Distributed Crawler System Based on Spark for Massive Data Mining
    Liu, Feng
    Xin, Wang
    2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2020), 2020, : 482 - 485
  • [47] Robust distributed multicategory angle-based classification for massive data
    Sun, Gaoming
    Wang, Xiaozhou
    Yan, Yibo
    Zhang, Riquan
    METRIKA, 2024, 87 (03) : 299 - 323
  • [48] Protective estimation of mixed-effects logistic regression when data are not missing at random
    Skrondal, A.
    Rabe-Hesketh, S.
    BIOMETRIKA, 2014, 101 (01) : 175 - 188
  • [49] Process-Monitoring-for-Quality - A Model Selection Criterion for l1-Regularized Logistic Regression
    Escobar, Carlos A.
    Morales-Menendez, Ruben
    47TH SME NORTH AMERICAN MANUFACTURING RESEARCH CONFERENCE (NAMRC 47), 2019, 34 : 832 - 839
  • [50] Analyzing Big EHR Data-Optimal Cox Regression Subsampling Procedure with Rare Events
    Keret, Nir
    Gorfine, Malka
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2023, 118 (544) : 2262 - 2275