A Parallel DistributedWeka Framework for Big Data Mining using Spark

被引:35
作者
Koliopoulos, Aris-Kyriakos [1 ]
Yiapanis, Paraskevas [1 ]
Tekiner, Firat [1 ]
Nenadic, Goran [1 ]
Keane, John [1 ]
机构
[1] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England
来源
2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015 | 2015年
关键词
Weka; Spark; Distributed Systems; Data Mining; Big Data; Machine Learning;
D O I
10.1109/BigDataCongress.2015.12
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Effective Big Data Mining requires scalable and efficient solutions that are also accessible to users of all levels of expertise. Despite this, many current efforts to provide effective knowledge extraction via large-scale Big Data Mining tools focus more on performance than on use and tuning which are complex problems even for experts. Weka is a popular and comprehensive Data Mining workbench with a well-known and intuitive interface; nonetheless it supports only sequential single-node execution. Hence, the size of the datasets and processing tasks that Weka can handle within its existing environment is limited both by the amount of memory in a single node and by sequential execution. This work discusses DistributedWekaSpark, a distributed framework for Weka which maintains its existing user interface. The framework is implemented on top of Spark, a Hadoop-related distributed framework with fast in-memory processing capabilities and support for iterative computations. By combining Weka's usability and Spark's processing power, DistributedWekaSpark provides a usable prototype distributed Big Data Mining workbench that achieves near-linear scaling in executing various real-world scale workloads - 91.4% weak scaling efficiency on average and up to 4x faster on average than Hadoop.
引用
收藏
页码:9 / 16
页数:8
相关论文
共 23 条
[1]  
Aggarwal C.C., 2014, DATA CLASSIFICATION
[2]   Parallel mining of association rules [J].
Agrawal, R ;
Shafer, JC .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1996, 8 (06) :962-969
[3]  
Ananthanarayanan G., 2011, C HOT TOP OP SYST
[4]  
[Anonymous], 2011, RADOOP ANAL BIG DATA
[5]  
[Anonymous], 2012, P 9 USENIX C NET WOR
[6]  
[Anonymous], 1996, chapter From data mining to knowledge discovery: an overview, DOI DOI 10.1016/j.aap.2005.03.023
[7]  
Appuswamy R., 2013, CLOUD COMPUTING
[8]  
Beyer Mark., IMPORTANCE BIG DATA
[9]  
Celis S., 2002, TECH REP
[10]  
DAS S., 2010, ACM SIGMOD INT C MAN, P987, DOI [DOI 10.1145/1807167.1807275, 10.1145/1807167]