Frequent Itemsets Mining for Big Data: A Comparative Analysis

被引:29
作者
Apiletti, Daniele [1 ]
Baralis, Elena [1 ]
Cerquitelli, Tania [1 ]
Garza, Paolo [1 ]
Pulvirenti, Fabio [1 ]
Venturini, Luca [1 ]
机构
[1] Politecn Torino, Dipartimento Automat & Informat, Turin, Italy
关键词
Big Data; Frequent itemset mining; Hadoop and Spark platforms;
D O I
10.1016/j.bdr.2017.06.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Itemset mining is a well-known exploratory data mining technique used to discover interesting correlations hidden in a data collection. Since it supports different targeted analyses, it is profitably exploited in a wide range of different domains, ranging from network traffic data to medical records. With the increasing amount of generated data, different scalable algorithms have been developed, exploiting the advantages of distributed computing frameworks, such as Apache Hadoop and Spark. This paper reviews Hadoop-and Spark-based scalable algorithms addressing the frequent itemset mining problem in the Big Data domain through both theoretical and experimental comparative analyses. Since the itemset mining task is computationally expensive, its distribution and parallelization strategies heavily affect memory usage, load balancing, and communication costs. A detailed discussion of the algorithmic choices of the distributed methods for frequent itemset mining is followed by an experimental analysis comparing the performance of state-of-the-art distributed implementations on both synthetic and real datasets. The strengths and weaknesses of the algorithms are thoroughly discussed with respect to the dataset features (e.g., data distribution, average transaction length, number of records), and specific parameter settings. Finally, based on theoretical and experimental analyses, open research directions for the parallelization of the itemset mining problem are presented. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:67 / 83
页数:17
相关论文
共 43 条
  • [1] Aggarwal CC, 2014, CH CRC DATA MIN KNOW, P1
  • [2] DATABASE MINING - A PERFORMANCE PERSPECTIVE
    AGRAWAL, R
    IMIELINSKI, T
    SWAMI, A
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1993, 5 (06) : 914 - 925
  • [3] Agrawal R., P 20 INT C VERY LARG
  • [4] Efficient Machine Learning for Big Data: A Review
    Al-Jarrah, Omar Y.
    Yoo, Paul D.
    Muhaidat, Sami
    Karagiannidis, George K.
    Taha, Kamal
    [J]. BIG DATA RESEARCH, 2015, 2 (03) : 87 - 93
  • [5] [Anonymous], 2012, NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
  • [6] [Anonymous], 2014, FREQUENT PATTERN MIN, DOI DOI 10.1007/978-3-319-07821-2
  • [7] [Anonymous], 2016, ACM Trans. Model. Perform. Eval. Comput. Syst. (TOMPECS)
  • [8] [Anonymous], Survey on frequent pattern mining
  • [9] Ansari E., 2008, IAENG International Journal of Computer Science, V35, P377
  • [10] MeTA: Characterization of Medical Treatments at Different Abstraction Levels
    Antonelli, Dario
    Baralis, Elena
    Bruno, Giulia
    Cagliero, Luca
    Cerquitelli, Tania
    Chiusano, Silvia
    Garza, Paolo
    Mahoto, Naeem A.
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2015, 6 (04)