Frequent Itemsets Mining for Big Data: A Comparative Analysis

被引:29
作者
Apiletti, Daniele [1 ]
Baralis, Elena [1 ]
Cerquitelli, Tania [1 ]
Garza, Paolo [1 ]
Pulvirenti, Fabio [1 ]
Venturini, Luca [1 ]
机构
[1] Politecn Torino, Dipartimento Automat & Informat, Turin, Italy
关键词
Big Data; Frequent itemset mining; Hadoop and Spark platforms;
D O I
10.1016/j.bdr.2017.06.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Itemset mining is a well-known exploratory data mining technique used to discover interesting correlations hidden in a data collection. Since it supports different targeted analyses, it is profitably exploited in a wide range of different domains, ranging from network traffic data to medical records. With the increasing amount of generated data, different scalable algorithms have been developed, exploiting the advantages of distributed computing frameworks, such as Apache Hadoop and Spark. This paper reviews Hadoop-and Spark-based scalable algorithms addressing the frequent itemset mining problem in the Big Data domain through both theoretical and experimental comparative analyses. Since the itemset mining task is computationally expensive, its distribution and parallelization strategies heavily affect memory usage, load balancing, and communication costs. A detailed discussion of the algorithmic choices of the distributed methods for frequent itemset mining is followed by an experimental analysis comparing the performance of state-of-the-art distributed implementations on both synthetic and real datasets. The strengths and weaknesses of the algorithms are thoroughly discussed with respect to the dataset features (e.g., data distribution, average transaction length, number of records), and specific parameter settings. Finally, based on theoretical and experimental analyses, open research directions for the parallelization of the itemset mining problem are presented. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:67 / 83
页数:17
相关论文
共 43 条
  • [21] Frequent pattern mining: current status and future directions
    Han, Jiawei
    Cheng, Hong
    Xin, Dong
    Yan, Xifeng
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2007, 15 (01) : 55 - 86
  • [22] Han JW, 2000, SIGMOD RECORD, V29, P1
  • [23] The rise of "big data" on cloud computing: Review and open research issues
    Hashem, Ibrahim Abaker Targio
    Yaqoob, Ibrar
    Anuar, Nor Badrul
    Mokhtar, Salimah
    Gani, Abdullah
    Khan, Samee Ullah
    [J]. INFORMATION SYSTEMS, 2015, 47 : 98 - 115
  • [24] Mars: A MapReduce Framework on Graphics Processors
    He, Bingsheng
    Fang, Wenbin
    Luo, Qiong
    Govindaraju, Naga K.
    Wang, Tuyong
    [J]. PACT'08: PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2008, : 260 - 269
  • [25] Trends in big data analytics
    Kambatla, Karthik
    Kollias, Giorgos
    Kumar, Vipin
    Grama, Ananth
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (07) : 2561 - 2573
  • [26] Lan V, 2012, P 2012 INT C INF KNO, P369
  • [27] Li HY, 2008, RECSYS'08: PROCEEDINGS OF THE 2008 ACM CONFERENCE ON RECOMMENDER SYSTEMS, P107
  • [28] Visual text mining using association rules
    Lopes, A. A.
    Pinho, R.
    Paulovich, F. V.
    Minghim, R.
    [J]. COMPUTERS & GRAPHICS-UK, 2007, 31 (03): : 316 - 326
  • [29] Mampaey M., 2011, Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), P573, DOI DOI 10.1145/2020408.2020499
  • [30] Passive analysis of TCP anomalies
    Mellia, Marco
    Meo, Michela
    Muscariello, Luca
    Rossi, Dario
    [J]. COMPUTER NETWORKS, 2008, 52 (14) : 2663 - 2676