Narrow big data in a stream: Computational limitations and regression

被引:4
|
作者
Cerny, Michal [1 ]
机构
[1] Univ Econ, Dept Econometr, Fac Informat & Stat, Winston Churchill Sq 4, Prague 13067, Czech Republic
关键词
Data stream; On-line data; Restricted memory computing; Narrow Big Data; Regression; Kolmogorov complexity; RECURSIVE ESTIMATION; ROBUSTIFICATION; MODELS; CMARS;
D O I
10.1016/j.ins.2019.02.052
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the on-line model for a data stream: data points are waiting in a queue and are accessible one-by-one by a special instruction. When a data point is processed, it is dropped forever. The data stream is assumed to be so long that it cannot be stored in memory in full: the size of memory is assumed to be polynomial in the dimension of data, but not the number of observations. This is a natural model for Narrow Big Data. First we prove a negative theorem illustrating that this model leads to serious limitations: we show that some elementary statistics, such as sample quantiles, cannot be computed in this model (the proof is based on a Kolmogorov complexity argument). This raises a crucial question which data-analytic procedures can be implemented in the stream data model and which cannot be performed at all, or only approximately (with some loss of information). After the negative results, we turn our attention to several positive results from multivariate linear regression with Narrow Big Data. We prove that least-squares based estimators and regression diagnostic statistics, such as statistics based on the residual sum of squares, can be computed in this model efficiently. The class of statistics efficiently computable in the stream data model also includes two-stage procedures involving auxiliary regressions, such as White's heteroscedasticity test of Breusch-Godfrey autocorrelation test (which may be surprising because the procedures, as defined, seem to require a data point to be processed several times). The computation is done exactly: we do not use preprocessing steps involving data compression techniques with information loss (such as sampling or grouping) for a reduction of the size of the data set. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:379 / 392
页数:14
相关论文
共 50 条
  • [1] PALM: An Incremental Construction of Hyperplanes for Data Stream Regression
    Ferdaus, Md Meftahul
    Pratama, Mahardhika
    Anavatti, Sreenatha G.
    Garratt, Matthew A.
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2019, 27 (11) : 2115 - 2129
  • [2] On Ensemble Techniques for Data Stream Regression
    Gomes, Heitor Murilo
    Montiel, Jacob
    Mastelini, Saulo Martiello
    Pfahringer, Bernhard
    Bifet, Albert
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [3] Quantile regression in big data: A divide and conquer based strategy
    Chen, Lanjue
    Zhou, Yong
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 144
  • [4] A sliced inverse regression approach for data stream
    Marie Chavent
    Stéphane Girard
    Vanessa Kuentz-Simonet
    Benoit Liquet
    Thi Mong Ngoc Nguyen
    Jérôme Saracco
    Computational Statistics, 2014, 29 : 1129 - 1152
  • [5] A survey on data stream, big data and real-time
    Gomes E.H.A.
    Plentz P.D.M.
    De Rolt C.R.
    Dantas M.A.R.
    International Journal of Networking and Virtual Organisations, 2019, 20 (02) : 143 - 167
  • [6] A sliced inverse regression approach for data stream
    Chavent, Marie
    Girard, Stephane
    Kuentz-Simonet, Vanessa
    Liquet, Benoit
    Thi Mong Ngoc Nguyen
    Saracco, Jerome
    COMPUTATIONAL STATISTICS, 2014, 29 (05) : 1129 - 1152
  • [7] Adaptive Prediction Interval for Data Stream Regression
    Sun, Yibin
    Pfahringer, Bernhard
    Gomes, Heitor Murilo
    Bifet, Albert
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT III, PAKDD 2024, 2024, 14647 : 130 - 141
  • [8] Regression and Endogeneity Bias in Big Marketing Data
    Kaur, PankajDeep
    Arora, Sumedha
    PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON ECO-FRIENDLY COMPUTING AND COMMUNICATION SYSTEMS, 2015, 70 : 41 - 47
  • [9] Distributed quantile regression for longitudinal big data
    Fan, Ye
    Lin, Nan
    Yu, Liqun
    COMPUTATIONAL STATISTICS, 2024, 39 (02) : 751 - 779
  • [10] Time-Series Big Data Stream Evaluation
    Mursanto, Petrus
    Wibisono, Ari
    Bayu, Wendy D. W. T.
    Ahli, Valian Fil
    Rizki, May Iffah
    Hasani, Lintang Matahari
    Adibah, Jihan
    2020 5TH INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS 2020), 2020, : 43 - 47