A Parallel MapReduce Algorithm to Efficiently Support Itemset Mining on High Dimensional Data

被引:15
作者
Apiletti, Daniele [1 ]
Baralis, Elena [1 ]
Cerquitelli, Tania [1 ]
Garza, Paolo [1 ]
Pulvirenti, Fabio [1 ]
Michiardi, Pietro [2 ]
机构
[1] Politecn Torino, Dipartimento Automat & Informat, Turin, Italy
[2] Eurecom, Data Sci Dept, Sophia Antipolis, France
关键词
High-dimensional data; Frequent closed itemset mining; Hadoop framework; BIG DATA; ASSOCIATION RULES; CHALLENGES; FRAMEWORK;
D O I
10.1016/j.bdr.2017.10.004
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In today's world, large volumes of data are being continuously generated by many scientific applications, such as bioinformatics or networking. Since each monitored event is usually characterized by a variety of features, high-dimensional datasets have been continuously generated. To extract value from these complex collections of data, different exploratory data mining algorithms can be used to discover hidden and non-trivial correlations among data. Frequent closed itemset mining is an effective but computational expensive technique that is usually used to support data exploration. Thanks to the spread of distributed and parallel frameworks, the development of scalable approaches able to deal with the so called Big Data has been extended to frequent itemset mining. Unfortunately, most of the current algorithms are designed to cope with low-dimensional datasets, delivering poor performances in those use cases characterized by high-dimensional data. This work introduces PaMPa-HD, a MapReduce-based frequent closed itemset mining algorithm for high dimensional datasets. An efficient solution has been proposed to parallelize and speed up the mining process. Furthermore, different strategies have been proposed to easily configure the algorithm parameter. The experimental results, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and robustness to memory issues. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:53 / 69
页数:17
相关论文
共 35 条
[1]  
Afrati FN, 2013, PROC VLDB ENDOW, V6, P277
[2]   Efficient Machine Learning for Big Data: A Review [J].
Al-Jarrah, Omar Y. ;
Yoo, Paul D. ;
Muhaidat, Sami ;
Karagiannidis, George K. ;
Taha, Kamal .
BIG DATA RESEARCH, 2015, 2 (03) :87-93
[3]  
[Anonymous], 1994, P INT C VERY LARGE D
[4]  
[Anonymous], 2004, OSDI 04
[5]  
Apache Software Foundation, AP MAH SCAL MACH LEA
[6]  
Apiletti D., 2015, IEEE ICDM WORKSH HIG
[7]   SEARUM: a cloud-based SErvice for Association RUle Mining [J].
Apiletti, Daniele ;
Baralis, Elena ;
Cerquitelli, Tania ;
Chiusano, Silvia ;
Grimaudo, Luigi .
2013 12TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2013), 2013, :1283-1290
[8]   Characterizing network traffic by means of the NETMINE framework [J].
Apiletti, Daniele ;
Baralis, Elena ;
Cerquitelli, Tania ;
D'Elia, Vincenzo .
COMPUTER NETWORKS, 2009, 53 (06) :774-789
[9]   Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking [J].
Bermejo, Pablo ;
de la Ossa, Luis ;
Gamez, Jose A. ;
Puerta, Jose M. .
KNOWLEDGE-BASED SYSTEMS, 2012, 25 (01) :35-44
[10]  
Borthakur D, 2007, The hadoop distributed file system: Architecture and design