High Dimensional Data Stream Clustering Algorithm Based on Random Projection

被引:0
作者
Zhu Y. [1 ,2 ,3 ]
Chen S. [1 ,2 ]
机构
[1] College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing
[2] MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing
[3] College of Computer Science and Engineering, Sanjiang University, Nanjing
来源
Chen, Songcan (s.chen@nuaa.edu.cn) | 1683年 / Science Press卷 / 57期
基金
中国国家自然科学基金;
关键词
Adaptive resonance theory; Clustering; Data stream clustering; High dimensional data; Random projection;
D O I
10.7544/issn1000-1239.2020.20200432
中图分类号
学科分类号
摘要
High dimensional data streams emerge ubiquitously in many real-world applications such as network monitoring. Clustering such data streams differs from traditional data clustering algorithm where the given datasets are generally static and can be read and processed repeatedly, thus facing more challenges due to having to satisfy such constraints as bounded memory, single-pass, real-time response and concept-drift detection. Recently many methods of such type have been proposed. However, when dealing with high dimensional data, they often result in high computational cost and poor performance due to the curse of dimensionality. To address the above problem, in this paper we present a new clustering algorithm for data streams, called RPFART, by combining the random projection method with the adaptive resonance theory (ART) model that has linear computational complexity, uses a single parameter, i.e., the vigilance parameter to identify data clusters, and is robust to modest parameters setting. To gain insights into the performance improvement obtained by our algorithm, we analyze and identify the major influence of random projection on ART. Although our method is embarrassingly simple just by incorporating the random projection into ART, the experimental results on variety of benchmark datasets indicate that our method can still achieve comparable or even better performance than RPGStream algorithm even if the raw dimension is compressed up to 10% of the original one. For ACT1 dataset, its dimension is reduced from 67 500 to 6 750. © 2020, Science Press. All right reserved.
引用
收藏
页码:1683 / 1696
页数:13
相关论文
共 48 条
[1]  
Nguyen H L, Woon Y K, Ng W K., A survey on data stream clustering and classification, Knowledge and Information Systems, 45, 3, pp. 535-569, (2015)
[2]  
Gaber M M, Zaslavsky A, Krishnaswamy S., Mining data streams: A review, ACM Sigmod Record, 34, 2, pp. 18-26, (2005)
[3]  
Gama J, Rodrigues P P., An overview on mining data streams, Foundations of Computational, Intelligence, 6, pp. 29-45, (2009)
[4]  
Aggarwal C C., Data streams: An overview and scientific applications, Scientific Data Mining and Knowledge Discovery, pp. 377-397, (2009)
[5]  
Silva J A, Faria E R, Barros R C, Et al., Data stream clustering: A survey, ACM Computing Surveys, 46, 1, pp. 1-31, (2013)
[6]  
Li Yangyang, Yang Guoli, He Haiyang, Et al., A study of large-scale data clustering based on fuzzy clustering, Soft Computing, 20, 8, pp. 3231-3242, (2016)
[7]  
Zhang Pu, Shen Qiang, Fuzzy c-means based coincidental link filtering in support of inferring social networks from spatiotemporal data streams, Soft Computing, 22, 21, pp. 7015-7025, (2018)
[8]  
Carnein M, Assenmacher D, Trautmann H., An empirical comparison of stream clustering algorithms, Proc of the Computing Frontiers Conf, pp. 361-366, (2017)
[9]  
Carnein M, Trautmann H., Optimizing data stream representation: An extensive survey on stream clustering algorithms, Business & Information Systems Engineering, 61, 3, pp. 277-297, (2019)
[10]  
O'callaghan L, Mishra N, Meyerson A, Et al., Streaming-data algorithms for high-quality clustering, Proc of the 18th Int Conf on Data Engineering, pp. 685-694, (2002)