Model aggregation for doubly divided data with large size and large dimension

被引:2
|
作者
He, Baihua [1 ]
Liu, Yanyan [1 ]
Yin, Guosheng [2 ]
Wu, Yuanshan [3 ]
机构
[1] Wuhan Univ, Sch Math & Stat, Wuhan 430072, Hubei, Peoples R China
[2] Univ Hong Kong, Dept Stat & Actuarial Sci, Pokfulam Rd, Hong Kong, Peoples R China
[3] Zhongnan Univ Econ & Law, Sch Stat & Math, Wuhan 430073, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Communication efficiency; Computation complexity; Distributed algorithm; Greedy algorithm; High dimension; One-shot approach; Prediction; Storage ability; AVERAGING APPROACH; COMBINATION;
D O I
10.1007/s00180-022-01242-3
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Massive data are often featured with high dimensionality as well as large sample size, which typically cannot be stored in a single machine and thus make both analysis and prediction challenging. We propose a distributed gridding model aggregation (DGMA) approach to predicting the conditional mean of a response variable, which overcomes the storage limitation of a single machine and the curse of high dimensionality. Specifically, on each local machine that stores partial data of relatively moderate sample size, we develop the model aggregation approach by splitting predictors wherein a greedy algorithm is developed. To obtain the optimal weights across all local machines, we further design a distributed and communication-efficient algorithm. Our procedure effectively distributes the workload and dramatically reduces the communication cost. Extensive numerical experiments are carried out on both simulated and real datasets to demonstrate the feasibility of the DGMA method.
引用
收藏
页码:509 / 529
页数:21
相关论文
共 50 条
  • [1] Model aggregation for doubly divided data with large size and large dimension
    Baihua He
    Yanyan Liu
    Guosheng Yin
    Yuanshan Wu
    Computational Statistics, 2023, 38 : 509 - 529
  • [2] Divided-area parallel process in large size RPM
    Wang, Feng
    Yan, Yongnian
    Yan, Xuri
    Zhongguo Jixie Gongcheng/China Mechanical Engineering, 2000, 11 (04): : 456 - 457
  • [3] Divided-area parallel process in large size RPM
    2000, China Mech Eng Mag Off, China (11):
  • [4] CLUSTER SIZE DISTRIBUTION IN IRREVERSIBLE AGGREGATION AT LARGE TIMES
    VANDONGEN, PGJ
    ERNST, MH
    JOURNAL OF PHYSICS A-MATHEMATICAL AND GENERAL, 1985, 18 (14): : 2779 - 2793
  • [5] Research of the mathematical model in the CCD measuring about the dimension of large-size forging workpiece
    Nie, Shao-Min
    Zhang, Qing
    Li, Shu-Kui
    Xue, Yong-Dong
    Suxing Gongcheng Xuebao/Journal of Plasticity Engineering, 2006, 13 (06): : 110 - 113
  • [6] A Dynamic Task-Model Induction Model Based Induction Mining on Large and High Dimension Data
    Sunitha, G.
    Reddy, A. Rama Mohan
    Sriharsha, A. V.
    2009 INTERNATIONAL CONFERENCE ON ADVANCES IN RECENT TECHNOLOGIES IN COMMUNICATION AND COMPUTING (ARTCOM 2009), 2009, : 741 - +
  • [7] Aggregation algorithms for very large compressed data warehouses
    Li, JZ
    Rotem, D
    Srivastava, J
    PROCEEDINGS OF THE TWENTY-FIFTH INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1999, : 651 - 662
  • [8] Distributed Data Aggregation at Scale for Large Community of Users
    Liu, Belinda
    Ponnusamy, Thenna
    Ramakrishnan, Adithya
    Lang, Ziyu
    Bhaigond, Arjun
    Desai, Amit
    Yen Nguyen
    PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON BIG DATA AND EDUCATION (ICBDE 2018), 2018, : 48 - 51
  • [9] FANO VARIETIES OF LARGE DIMENSION AND LARGE INDEX
    KOLLAR, J
    VESTNIK MOSKOVSKOGO UNIVERSITETA SERIYA 1 MATEMATIKA MEKHANIKA, 1981, (03): : 31 - 34
  • [10] Creating Large Size of Data with Apache Hadoop
    Ruzicka, Jan
    Kocich, David
    Orcik, Lukas
    Svozilik, Vladislav
    RISE OF BIG SPATIAL DATA, 2017, : 307 - 314