A general framework for mining massive data streams

被引:89
|
作者
Domingos, P [1 ]
Hulten, G [1 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
data mining; Hoeffding bounds; machine learning; scalability; subsampling;
D O I
10.1198/1061860032544
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional "one-shot" data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.
引用
收藏
页码:945 / 949
页数:5
相关论文
共 50 条
  • [21] A framework for clustering massive graph streams
    Aggarwal C.C.
    Zhao Y.
    Yu P.S.
    Statistical Analysis and Data Mining, 2010, 3 (06): : 399 - 416
  • [22] A Framework for Fast-Feedback Opinion Mining on Twitter Data Streams
    Selvan, Lokmanyathilak Govindan Sankar
    Moh, Teng-Sheng
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON COLLABORATION TECHNOLOGIES AND SYSTEMS, 2015, : 314 - 318
  • [23] Crime data mining: A general framework and some examples
    Chen, HC
    Chung, WY
    Xu, JJ
    Wang, G
    Qin, Y
    Chau, M
    COMPUTER, 2004, 37 (04) : 50 - +
  • [24] Mining Massive E-Health Data Streams for IoMT Enabled Healthcare Systems
    Toor, Affan Ahmed
    Usman, Muhammad
    Younas, Farah
    Fong, Alvis Cheuk M.
    Khan, Sajid Ali
    Fong, Simon
    SENSORS, 2020, 20 (07)
  • [25] An efficient data processing framework for mining the massive trajectory of moving objects
    Zhou, Yuanchun
    Zhang, Yang
    Ge, Yong
    Xue, Zhenghua
    Fu, Yanjie
    Guo, Danhuai
    Shao, Jing
    Zhu, Tiangang
    Wang, Xuezhi
    Li, Jianhui
    COMPUTERS ENVIRONMENT AND URBAN SYSTEMS, 2017, 61 : 129 - 140
  • [26] GDSW: A General Framework for Distributed Sliding Window over Data Streams
    Chen, Huan
    Wang, Yijie
    Wang, Yuan
    Ma, Xingkong
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 729 - 736
  • [27] GPS: A General Framework for Parallel Queries over Data Streams in Cloud
    Li, Xiaoyong
    Wang, Yijie
    Zhao, Yue
    Wang, Yuan
    Li, Xiaoling
    2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 1139 - 1146
  • [28] Active mining of data streams
    Fan, W
    Huang, YA
    Wang, HX
    Yu, PS
    PROCEEDINGS OF THE FOURTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2004, : 457 - 461
  • [29] Mining databases and data streams
    Zaniolo, Carlo
    Thakkar, Hetal
    HOMELAND SECURITY TECHNOLOGY CHALLENGES: FROM SENSING AND ENCRYPTING TO MINING AND MODELING, 2008, : 103 - +
  • [30] Mining data streams: A review
    Gaber, MM
    Zaslavsky, A
    Krishnaswamy, S
    SIGMOD RECORD, 2005, 34 (02) : 18 - 26