On the performance of high dimensional data clustering and classification algorithms

被引:27
作者
Ericson, Kathleen [1 ]
Pallickara, Shrideep [1 ]
机构
[1] Colorado State Univ, Dept Comp Sci, Ft Collins, CO 80523 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2013年 / 29卷 / 04期
关键词
Machine learning; Distributed stream processing; Hadoop; Mahout; Clustering; Classification; Granules;
D O I
10.1016/j.future.2012.05.026
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorithms supported by Mahout within two different cloud runtimes, Hadoop and Granules. Our benchmarks use the same Mahout backend code, ensuring a fair comparison. The differences between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We include an analysis of our results for each of these algorithms in a distributed setting, as well as a discussion on measures for failure recovery. (c) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:1024 / 1034
页数:11
相关论文
共 34 条
  • [1] Anderson CharlesW., 2008, Proceedings of the Fourteenth Yale Workshop on Adap- tive and Learning Systems, P1
  • [2] [Anonymous], IEEE INT C CLUST COM
  • [3] [Anonymous], 2007, KDD CUP WORKSH
  • [4] [Anonymous], IEEE INT C E SCI IND
  • [5] [Anonymous], WILEY ENCY TELECOMMU
  • [6] [Anonymous], 2008, P 25 INT C MACH LEAR
  • [7] [Anonymous], ADV NEURAL INFORM PR
  • [8] [Anonymous], 2007, 2 ACM SIGOPS EUROSYS
  • [9] [Anonymous], CLOUDCOM 2010 IND US
  • [10] [Anonymous], 2000, P 6 ACM SIGKDD INT C