Dealing With Data Streams An Online, Row-by-Row, Estimation Tutorial

被引:4
作者
Ippel, Lianne [1 ]
Kaptein, Maurits [1 ]
Vermunt, Jeroen [1 ]
机构
[1] Tilburg Univ, Methodol & Stat, Warandelaan 2,Postbus 90153, NL-5000 LL Tilburg, Netherlands
关键词
Big Data; data streams; machine learning; online learning; Stochastic Gradient Descent; MAXIMUM-LIKELIHOOD; DESIGN; EM;
D O I
10.1027/1614-2241/a000116
中图分类号
O1 [数学]; C [社会科学总论];
学科分类号
03 ; 0303 ; 0701 ; 070101 ;
摘要
Novel technological advances allow distributed and automatic measurement of human behavior. While these technologies provide exciting new research opportunities, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously augmented. These data streams make the standard computation of well-known estimators inefficient as the computation has to be repeated each time a new data point enters. In this tutorial paper, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or "row-by-row," processing approach. We present several simple ( and exact) examples of the online estimation and discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this article with a discussion of the methodological challenges that remain.
引用
收藏
页码:124 / 138
页数:15
相关论文
共 50 条
  • [1] Techniques for dealing with incomplete data: a tutorial and survey
    Aste, Marco
    Boninsegna, Massimo
    Freno, Antonino
    Trentin, Edmondo
    PATTERN ANALYSIS AND APPLICATIONS, 2015, 18 (01) : 1 - 29
  • [2] Ensemble of Distributed Learners for Online Classification of Dynamic Data Streams
    Canzian, Luca
    Zhang, Yu
    van der Schaar, Mihaela
    IEEE TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING OVER NETWORKS, 2015, 1 (03): : 180 - 194
  • [3] The online performance estimation framework: heterogeneous ensemble learning for data streams
    Jan N. van Rijn
    Geoffrey Holmes
    Bernhard Pfahringer
    Joaquin Vanschoren
    Machine Learning, 2018, 107 : 149 - 176
  • [4] The online performance estimation framework: heterogeneous ensemble learning for data streams
    van Rijn, Jan N.
    Holmes, Geoffrey
    Pfahringer, Bernhard
    Vanschoren, Joaquin
    MACHINE LEARNING, 2018, 107 (01) : 149 - 176
  • [5] Accurate Quantile Estimation for Skewed Data Streams
    Lin, Zheng
    Liu, Jun
    Lin, Nan
    2017 IEEE 28TH ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR, AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2017,
  • [6] An online fuzzy model for classification of data streams with drift
    Shahparast, Homeira
    Mansoori, Eghbal G.
    2017 19TH CSI INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND SIGNAL PROCESSING (AISP), 2017, : 91 - 95
  • [7] Online Learning From Incomplete and Imbalanced Data Streams
    You, Dianlong
    Xiao, Jiawei
    Wang, Yang
    Yan, Huigui
    Wu, Di
    Chen, Zhen
    Shen, Limin
    Wu, Xindong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (10) : 10650 - 10665
  • [8] Online clustering of parallel data streams
    Beringer, Juergen
    Huellermeier, Eyke
    DATA & KNOWLEDGE ENGINEERING, 2006, 58 (02) : 180 - 204
  • [9] Online learning for data streams with bi-dynamic distributions
    Yan, Huigui
    Liu, Jiale
    Xiao, Jiawei
    Niu, Shina
    Dong, Siqi
    You, Dianlong
    Shen, Limin
    INFORMATION SCIENCES, 2024, 676
  • [10] Unsupervised online detection and prediction of outliers in streams of sensor data
    Reunanen, Niko
    Raty, Tomi
    Jokinen, Juho J.
    Hoyt, Tyler
    Culler, David
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2020, 9 (03) : 285 - 314