On Data-Enriched Logistic Regression

被引:0
|
作者
Zheng, Cheng [1 ]
Dasgupta, Sayan [2 ]
Xie, Yuxiang [3 ]
Haris, Asad [3 ]
Chen, Ying-Qing [4 ]
机构
[1] Univ Nebraska Med Ctr, Dept Biostat, Omaha, NE 68198 USA
[2] Fred Hutchinson Canc Ctr, Vaccine & Infect Dis Div, Seattle, WA 98109 USA
[3] Univ Washington, Dept Biostat, Seattle, WA 98195 USA
[4] Stanford Univ, Dept Med, Palo Alto, CA 94305 USA
关键词
risk prediction; logistic regression; shrinkage estimator; big data; VARIABLE SELECTION; REGULARIZATION; SHRINKAGE; MODEL; RISK;
D O I
10.3390/math13030441
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Biomedical researchers typically investigate the effects of specific exposures on disease risks within a well-defined population. The gold standard for such studies is to design a trial with an appropriately sampled cohort. However, due to the high cost of such trials, the collected sample sizes are often limited, making it difficult to accurately estimate the effects of certain exposures. In this paper, we discuss how to leverage the information from external "big data" (datasets with significantly larger sample sizes) to improve the estimation accuracy at the risk of introducing a small amount of bias. We propose a family of weighted estimators to balance bias increase and variance reduction when incorporating the big data. We establish a connection between our proposed estimator and the well-known penalized regression estimators. We derive optimal weights using both second-order and higher-order asymptotic expansions. Through extensive simulation studies, we demonstrate that the improvement in mean square error (MSE) for the regression coefficient can be substantial even with finite sample sizes, and our weighted method outperformed existing approaches such as penalized regression and James-Stein estimator. Additionally, we provide a theoretical guarantee that the proposed estimators will never yield an asymptotic MSE larger than the maximum likelihood estimator using small data only in general. Finally, we apply our proposed methods to the Asia Cohort Consortium China cohort data to estimate the relationships between age, BMI, smoking, alcohol use, and mortality.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Multinomial Principal Component Logistic Regression on Shape Data
    Meisam Moghimbeygi
    Anahita Nodehi
    Journal of Classification, 2022, 39 : 578 - 599
  • [42] LOGISTIC-REGRESSION ANALYSIS OF SAMPLE SURVEY DATA
    ROBERTS, G
    RAO, JNK
    KUMAR, S
    BIOMETRIKA, 1987, 74 (01) : 1 - 12
  • [43] Multinomial Principal Component Logistic Regression on Shape Data
    Moghimbeygi, Meisam
    Nodehi, Anahita
    JOURNAL OF CLASSIFICATION, 2022, 39 (03) : 578 - 599
  • [44] Partitioned GMM logistic regression models for longitudinal data
    Irimata, Kyle M.
    Broatch, Jennifer
    Wilson, Jeffrey R.
    STATISTICS IN MEDICINE, 2019, 38 (12) : 2171 - 2183
  • [45] Local logistic regression: An application to Army penetration data
    Nottingham, QJ
    Birch, JB
    Bodt, BA
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2000, 66 (01) : 35 - 50
  • [46] Compression and Aggregation for Logistic Regression Analysis in Data Cubes
    Xi, Ruibin
    Lin, Nan
    Chen, Yixin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (04) : 479 - 492
  • [47] Confidence intervals for multinomial logistic regression in sparse data
    Bull, Shelley B.
    Lewinger, Juan Pablo
    Lee, Sophia S. F.
    STATISTICS IN MEDICINE, 2007, 26 (04) : 903 - 918
  • [48] Logistic regression model for analyzing extended haplotype data
    Wallenstein, S
    Hodge, SE
    Weston, A
    GENETIC EPIDEMIOLOGY, 1998, 15 (02) : 173 - 181
  • [49] USE OF LOGISTIC-REGRESSION IN ANALYZING MORBIDITY DATA
    SIEBER, WK
    BIOMETRICS, 1985, 41 (01) : 324 - 324
  • [50] Marginal logistic regression for spatially clustered binary data
    Cattelan, Manuela
    Varin, Cristiano
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2018, 67 (04) : 939 - 959