On Data-Enriched Logistic Regression

被引:0
|
作者
Zheng, Cheng [1 ]
Dasgupta, Sayan [2 ]
Xie, Yuxiang [3 ]
Haris, Asad [3 ]
Chen, Ying-Qing [4 ]
机构
[1] Univ Nebraska Med Ctr, Dept Biostat, Omaha, NE 68198 USA
[2] Fred Hutchinson Canc Ctr, Vaccine & Infect Dis Div, Seattle, WA 98109 USA
[3] Univ Washington, Dept Biostat, Seattle, WA 98195 USA
[4] Stanford Univ, Dept Med, Palo Alto, CA 94305 USA
关键词
risk prediction; logistic regression; shrinkage estimator; big data; VARIABLE SELECTION; REGULARIZATION; SHRINKAGE; MODEL; RISK;
D O I
10.3390/math13030441
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Biomedical researchers typically investigate the effects of specific exposures on disease risks within a well-defined population. The gold standard for such studies is to design a trial with an appropriately sampled cohort. However, due to the high cost of such trials, the collected sample sizes are often limited, making it difficult to accurately estimate the effects of certain exposures. In this paper, we discuss how to leverage the information from external "big data" (datasets with significantly larger sample sizes) to improve the estimation accuracy at the risk of introducing a small amount of bias. We propose a family of weighted estimators to balance bias increase and variance reduction when incorporating the big data. We establish a connection between our proposed estimator and the well-known penalized regression estimators. We derive optimal weights using both second-order and higher-order asymptotic expansions. Through extensive simulation studies, we demonstrate that the improvement in mean square error (MSE) for the regression coefficient can be substantial even with finite sample sizes, and our weighted method outperformed existing approaches such as penalized regression and James-Stein estimator. Additionally, we provide a theoretical guarantee that the proposed estimators will never yield an asymptotic MSE larger than the maximum likelihood estimator using small data only in general. Finally, we apply our proposed methods to the Asia Cohort Consortium China cohort data to estimate the relationships between age, BMI, smoking, alcohol use, and mortality.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Application of Logistic Regression with Filter in Data Classification
    Yang, Zan
    Li, Dan
    PROCEEDINGS OF THE 38TH CHINESE CONTROL CONFERENCE (CCC), 2019, : 3755 - 3759
  • [22] Logistic Regression on Homomorphic Encrypted Data at Scale
    Han, Kyoohyung
    Hong, Seungwan
    Cheon, Jung Hee
    Park, Daejun
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9466 - 9471
  • [23] Deterministic subsampling for logistic regression with massive data
    Song, Yan
    Dai, Wenlin
    COMPUTATIONAL STATISTICS, 2024, 39 (02) : 709 - 732
  • [24] Logistic regression for evolving data streams classification
    Dept. of Computer Science and Eng., Shanghai Jiaotong Univ., Shanghai 200030, China
    J. Shanghai Jiaotong Univ. Sci., 2007, 2 (197-203):
  • [25] Multilevel logistic regression for polytomous data and rankings
    Anders Skrondal
    Sophia Rabe-Hesketh
    Psychometrika, 2003, 68 : 267 - 287
  • [26] An asymmetric logistic regression model for ecological data
    Komori, Osamu
    Eguchi, Shinto
    Ikeda, Shiro
    Okamura, Hiroshi
    Ichinokawa, Momoko
    Nakayama, Shinichiro
    METHODS IN ECOLOGY AND EVOLUTION, 2016, 7 (02): : 249 - 260
  • [27] Logistic regression analysis of customer satisfaction data
    Lawson, Cathy
    Montgomery, Douglas C.
    QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, 2006, 22 (08) : 971 - 984
  • [28] PLS-Logistic Regression on Functional Data
    Wang, Jie
    Wang, Shengshuai
    Huang, Kefei
    Li, Ying
    PLS '09: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON PARTIAL LEAST SQUARES AND RELATED METHODS, 2009, : 71 - 76
  • [29] Deterministic subsampling for logistic regression with massive data
    Yan Song
    Wenlin Dai
    Computational Statistics, 2024, 39 : 709 - 732
  • [30] COLLINEARITY AND SEPARATED DATA IN THE LOGISTIC REGRESSION MODEL
    Godinez-Jaimes, Flaviano
    Ramirez-Valverde, Gustavo
    Reyes-Carreto, Ramon
    Julian Ariza-Hernandez, F.
    Barrera-Rodriguez, Ella
    AGROCIENCIA, 2012, 46 (04) : 411 - 425