Distributed Statistical Analyses: A Scoping Review and Examples of Operational Frameworks Adapted to Health Analytics

被引:0
作者
Lemyre, Felix Camirand [1 ]
Levesque, Simon [1 ,2 ]
Domingue, Marie-Pier [1 ,2 ]
Herrmann, Klaus
Ethier, Jean-Francois [1 ,3 ,4 ]
机构
[1] Univ Sherbrooke, GRIIS, 2500 Boul Univ, Sherbrooke, PQ J1K 2R1, Canada
[2] Univ Sherbrooke, Fac Sci, Dept Math, Sherbrooke, PQ, Canada
[3] Hlth Data Res Network Canada, Vancouver, BC, Canada
[4] Univ Sherbrooke, Fac Med & Sci Sante, Dept Med, Sherbrooke, PQ, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
distributed algorithms; generalized linear models; horizontally partitioned data; GLMs; learning health systems; distributed analysis; federated analysis; data science; data custodians; algorithms; statistics; synthesis; review methods; searches; scoping; COX MODEL; REGRESSION; INFERENCE; ROBUST; ESTIMATORS;
D O I
10.2196/53622
中图分类号
R-058 [];
学科分类号
摘要
Background: Data from multiple organizations are crucial for advancing learning health systems. However, ethical, legal, and social concerns may restrict the use of standard statistical methods that rely on pooling data. Although distributed algorithms offer alternatives, they may not always be suitable for health frameworks. Objective: This study aims to support researchers and data custodians in three ways: (1) providing a concise overview of the literature on statistical inference methods for horizontally partitioned data, (2) describing the methods applicable to generalized linear models (GLMs) and assessing their underlying distributional assumptions, and (3) adapting existing methods to make them fully usable in health settings. Methods: A scoping review methodology was used for the literature mapping, from which methods presenting a methodological framework for GLM analyses with horizontally partitioned data were identified and assessed from the perspective of applicability in health settings. Statistical theory was used to adapt methods and derive the properties of the resulting estimators. Results: From the review, 41 articles were selected and 6 approaches were extracted to conduct standard GLM-based statistical analysis. However, these approaches assumed evenly and identically distributed data across nodes. Consequently, statistical procedures were derived to accommodate uneven node sample sizes and heterogeneous data distributions across nodes. Workflows and detailed algorithms were developed to highlight information sharing requirements and operational complexity. Conclusions: This study contributes to the field of health analytics by providing an overview of the methods that can be used with horizontally partitioned data by adapting these methods to the context of heterogeneous health data and clarifying the workflows and quantities exchanged by the methods discussed. Further analysis of the confidentiality preserved by these methods is needed to fully understand the risk associated with the sharing of summary statistics.
引用
收藏
页数:27
相关论文
共 61 条
  • [1] Agresti A., 2015, FDN LINEAR GEN LINEA
  • [2] Arksey H., 2005, INT J SOC RES METHOD, V8, P19, DOI DOI 10.1080/1364557032000119616
  • [3] Distributed inference for degenerate U-statistics
    Atta-Asiamah, Ernest
    Yuan, Mingao
    [J]. STAT, 2019, 8 (01):
  • [4] DIVIDE AND CONQUER IN NONSTANDARD PROBLEMS AND THE SUPER-EFFICIENCY PHENOMENON
    Banerjee, Moulinath
    Durot, Cecile
    Sen, Bodhisattva
    [J]. ANNALS OF STATISTICS, 2019, 47 (02) : 720 - 757
  • [5] Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data
    Basiri, Shahab
    Ollila, Esa
    Koivunen, Visa
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2016, 64 (04) : 1007 - 1017
  • [6] DISTRIBUTED TESTING AND ESTIMATION UNDER SPARSE HIGH DIMENSIONAL MODELS
    Battey, Heather
    Fan, Jianqing
    Liu, Han
    Lu, Junwei
    Zhu, Ziwei
    [J]. ANNALS OF STATISTICS, 2018, 46 (03) : 1352 - 1382
  • [7] Distributed Analytics on Sensitive Medical Data: The Personal Health Train
    Beyan, Oya
    Choudhury, Ananya
    van Soest, Johan
    Kohlbacher, Oliver
    Zimmermann, Lukas
    Stenzhorn, Holger
    Karim, Md Rezaul
    Dumontier, Michel
    Decker, Stefan
    Santos, Luiz Olavo Bonino da Silva
    Dekker, Andre
    [J]. DATA INTELLIGENCE, 2020, 2 (1-2) : 96 - 107
  • [8] Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications
    Bruce, Scott
    Li, Zeda
    Yang, Hsiang-Chieh
    Mukhopadhyay, Subhadeep
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2019, 5 (02) : 166 - 179
  • [9] CEDAR: communication efficient distributed analysis for regressions
    Chang, Changgee
    Bu, Zhiqi
    Long, Qi
    [J]. BIOMETRICS, 2023, 79 (03) : 2357 - 2369
  • [10] DISTRIBUTED STATISTICAL INFERENCE FOR MASSIVE DATA
    Chen, Song Xi
    Peng, Liuhua
    [J]. ANNALS OF STATISTICS, 2021, 49 (05) : 2851 - 2869