Narrow big data in a stream: Computational limitations and regression

被引:4
作者
Cerny, Michal [1 ]
机构
[1] Univ Econ, Dept Econometr, Fac Informat & Stat, Winston Churchill Sq 4, Prague 13067, Czech Republic
关键词
Data stream; On-line data; Restricted memory computing; Narrow Big Data; Regression; Kolmogorov complexity; RECURSIVE ESTIMATION; ROBUSTIFICATION; MODELS; CMARS;
D O I
10.1016/j.ins.2019.02.052
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the on-line model for a data stream: data points are waiting in a queue and are accessible one-by-one by a special instruction. When a data point is processed, it is dropped forever. The data stream is assumed to be so long that it cannot be stored in memory in full: the size of memory is assumed to be polynomial in the dimension of data, but not the number of observations. This is a natural model for Narrow Big Data. First we prove a negative theorem illustrating that this model leads to serious limitations: we show that some elementary statistics, such as sample quantiles, cannot be computed in this model (the proof is based on a Kolmogorov complexity argument). This raises a crucial question which data-analytic procedures can be implemented in the stream data model and which cannot be performed at all, or only approximately (with some loss of information). After the negative results, we turn our attention to several positive results from multivariate linear regression with Narrow Big Data. We prove that least-squares based estimators and regression diagnostic statistics, such as statistics based on the residual sum of squares, can be computed in this model efficiently. The class of statistics efficiently computable in the stream data model also includes two-stage procedures involving auxiliary regressions, such as White's heteroscedasticity test of Breusch-Godfrey autocorrelation test (which may be surprising because the procedures, as defined, seem to require a data point to be processed several times). The computation is done exactly: we do not use preprocessing steps involving data compression techniques with information loss (such as sampling or grouping) for a reduction of the size of the data set. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:379 / 392
页数:14
相关论文
共 50 条
  • [11] Dynamic Pattern Detection for Big Data Stream Analytics
    Xylogiannopoulos, Konstantinos F.
    Karampelas, Panagiotis
    Alhajj, Reda
    SOCIAL NETWORK BASED BIG DATA ANALYSIS AND APPLICATIONS, 2018, : 183 - 200
  • [12] Fast Gaussian Process Regression for Big Data
    Das, Sourish
    Roy, Sasanka
    Sambasivan, Rajiv
    BIG DATA RESEARCH, 2018, 14 : 12 - 26
  • [13] Bayesian Quantile Regression for Big Data Analysis
    Chu, Yuanqi
    Hu, Xueping
    Yu, Keming
    NEW FRONTIERS IN BAYESIAN STATISTICS, BAYSM 2021, 2022, 405 : 11 - 22
  • [14] Vertical and Horizontal Partitioning in Data Stream Regression Ensembles
    Barddal, Jean Paul
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [15] Optimally Weighted Cluster Kriging for Big Data Regression
    van Stein, Bas
    Wang, Hao
    Kowalczyk, Wojtek
    Back, Thomas
    Emmerich, Michael
    ADVANCES IN INTELLIGENT DATA ANALYSIS XIV, 2015, 9385 : 310 - 321
  • [16] A MapReduce-Based ELM for Regression in Big Data
    Wu, B.
    Yan, T. H.
    Xu, X. S.
    He, B.
    Li, W. H.
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2016, 2016, 9937 : 164 - 173
  • [17] Computational Health Informatics in the Big Data Age: A Survey
    Fang, Ruogu
    Pouyanfar, Samira
    Yang, Yimin
    Chen, Shu-Ching
    Iyengar, S. S.
    ACM COMPUTING SURVEYS, 2016, 49 (01)
  • [18] A Fuzzy Drift Correlation Matrix for Multiple Data Stream Regression
    Song, Yiliao
    Zhang, Guangquan
    Lu, Haiyan
    Lu, Jie
    2020 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2020,
  • [19] Implementation of Data Stream Classification Neural Network Models Over Big Data Platforms
    Puentes-Marchal, Fernando
    Dolores Perez-Godoy, Maria
    Gonzalez, Pedro
    Jose Del Jesus, Maria
    ADVANCES IN COMPUTATIONAL INTELLIGENCE (IWANN 2021), PT II, 2021, 12862 : 272 - 280
  • [20] Stream of Unbalanced Medical Big Data Using Convolutional Neural Network
    Gao, Weiwei
    Chen, Li
    Shang, Tao
    IEEE ACCESS, 2020, 8 : 81310 - 81319