Frequent Itemsets Mining for Big Data: A Comparative Analysis

被引：31

作者：

Apiletti, Daniele ^{[1
]}

Baralis, Elena ^{[1
]}

Cerquitelli, Tania ^{[1
]}

Garza, Paolo ^{[1
]}

Pulvirenti, Fabio ^{[1
]}

Venturini, Luca ^{[1
]}

机构：

[1] Politecn Torino, Dipartimento Automat & Informat, Turin, Italy

来源：

BIG DATA RESEARCH | 2017年 / 9卷

关键词：

Big Data; Frequent itemset mining; Hadoop and Spark platforms;

D O I：

10.1016/j.bdr.2017.06.006

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Itemset mining is a well-known exploratory data mining technique used to discover interesting correlations hidden in a data collection. Since it supports different targeted analyses, it is profitably exploited in a wide range of different domains, ranging from network traffic data to medical records. With the increasing amount of generated data, different scalable algorithms have been developed, exploiting the advantages of distributed computing frameworks, such as Apache Hadoop and Spark. This paper reviews Hadoop-and Spark-based scalable algorithms addressing the frequent itemset mining problem in the Big Data domain through both theoretical and experimental comparative analyses. Since the itemset mining task is computationally expensive, its distribution and parallelization strategies heavily affect memory usage, load balancing, and communication costs. A detailed discussion of the algorithmic choices of the distributed methods for frequent itemset mining is followed by an experimental analysis comparing the performance of state-of-the-art distributed implementations on both synthetic and real datasets. The strengths and weaknesses of the algorithms are thoroughly discussed with respect to the dataset features (e.g., data distribution, average transaction length, number of records), and specific parameter settings. Finally, based on theoretical and experimental analyses, open research directions for the parallelization of the itemset mining problem are presented. (C) 2017 Elsevier Inc. All rights reserved.

引用

页码：67 / 83

页数：17

共 43 条

[21] Frequent pattern mining: current status and future directions [J].

Han, Jiawei ;

Cheng, Hong ;

Xin, Dong ;

Yan, Xifeng .

DATA MINING AND KNOWLEDGE DISCOVERY, 2007, 15 (01) :55-86

[22]

Han JW, 2000, SIGMOD RECORD, V29, P1

[23] The rise of "big data" on cloud computing: Review and open research issues [J].

Hashem, Ibrahim Abaker Targio ;

Yaqoob, Ibrar ;

Anuar, Nor Badrul ;

Mokhtar, Salimah ;

Gani, Abdullah ;

Khan, Samee Ullah .

INFORMATION SYSTEMS, 2015, 47 :98-115

[24] Mars: A MapReduce Framework on Graphics Processors [J].

He, Bingsheng ;

Fang, Wenbin ;

Luo, Qiong ;

Govindaraju, Naga K. ;

Wang, Tuyong .

PACT'08: PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2008, :260-269

[25] Trends in big data analytics [J].

Kambatla, Karthik ;

Kollias, Giorgos ;

Kumar, Vipin ;

Grama, Ananth .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (07) :2561-2573

[26]

Lan V, 2012, P 2012 INT C INF KNO, P369

[27]

Li HY, 2008, RECSYS'08: PROCEEDINGS OF THE 2008 ACM CONFERENCE ON RECOMMENDER SYSTEMS, P107

[28] Visual text mining using association rules [J].

Lopes, A. A. ;

Pinho, R. ;

Paulovich, F. V. ;

Minghim, R. .

COMPUTERS & GRAPHICS-UK, 2007, 31 (03) :316-326

[29]

Mampaey M., 2011, Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), P573, DOI DOI 10.1145/2020408.2020499

[30] Passive analysis of TCP anomalies [J].

Mellia, Marco ;

Meo, Michela ;

Muscariello, Luca ;

Rossi, Dario .

COMPUTER NETWORKS, 2008, 52 (14) :2663-2676

← 1 2 3 4 5 →