Optimization of frequent item set mining parallelization algorithm based on spark platform

被引：0

作者：

Deng, Fan ^{[1
]}

Wang, Jiabin ^{[1
]}

Lv, Sheng ^{[1
]}

机构：

[1] Huaqiao Univ, Sch Engn, Quanzhou 362011, Fujian, Peoples R China

来源：

DISCOVER COMPUTING | 2024年 / 27卷 / 01期

关键词：

Frequent pattern mining; Spark parallelization; Transaction compression; Boolean matrices;

D O I：

10.1007/s10791-024-09470-5

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we propose a new method that combines the parallelism of the Spark-based platform with fast frequent mining, called STB_Apriori. Previous research has shown that traditional frequent itemset mining algorithms have high overhead when faced with large datasets and high-dimensional data computation, and generate a large number of candidate itemsets; at the same time, when faced with diverse user requirements, they often generate very sparse and diverse data. In order to solve the problem of fast mining of massive data, our idea originates from the capability of Spark distributed computing and the common optimisation ideas in Apriori mining, by using the efficient operator BitSet to achieve transaction compression, bit storage and data manipulation by Boolean matrices, and at the same time by parallelising the processing and optimising the algorithmic logic to achieve fast and frequent mining. In experiments on real-world datasets, our model consistently outperforms five widely used methods by a significant margin on very large data and maintains its excellence in the remaining cases, proving its effectiveness on real-world tasks, while further analysis shows that increasing the number of distributed nodes also incrementally and continuously improves performance.

引用

页数：19

共 22 条

[11] Research on Parallelization of Microblog Emotional Analysis Algorithms Using Deep Learning and Attention Model Based on Spark Platform
Shi, Min
IEEE ACCESS, 2019, 7 : 177211 - 177218
[12] A Graph-Based Differentially Private Algorithm for Mining Frequent Sequential Patterns
Nunez-del-Prado, Miguel
Maehara-Aliaga, Yoshitomi
Salas, Julian
Alatrista-Salas, Hugo
Megias, David
APPLIED SCIENCES-BASEL, 2022, 12 (04):
[13] An uncertainty-based approach: Frequent itemset mining from uncertain data with different item importance
Lee, Gangin
Yun, Unil
Ryang, Heungmo
KNOWLEDGE-BASED SYSTEMS, 2015, 90 : 239 - 256
[14] A NOVEL ALGORITHM FOR FAST MINING FREQUENT PATTERNS BASED ON SUPPORT LIST STRUCTURE
Zhu, Xiaolin
JOURNAL OF NONLINEAR AND CONVEX ANALYSIS, 2022, 23 (09) : 1943 - 1966
[15] A non-group parallel frequent pattern mining algorithm based on conditional patterns
Kuang, Zhe-jun
Zhou, Hang
Zhou, Dong-dai
Zhou, Jin-peng
Yang, Kun
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2019, 20 (09) : 1234 - 1245
[16] A Frequent Pattern Mining Algorithm Based on FP-growth without Generating Tree
Tohidi, Hossein
Ibrahim, Hamidah
PROCEEDINGS OF KNOWLEDGE MANAGEMENT 5TH INTERNATIONAL CONFERENCE 2010, 2010, : 723 - 728
[17] A non-group parallel frequent pattern mining algorithm based on conditional patterns
Zhe-jun Kuang
Hang Zhou
Dong-dai Zhou
Jin-peng Zhou
Kun Yang
Frontiers of Information Technology & Electronic Engineering, 2019, 20 : 1234 - 1245
[18] A Schema Feature Based Frequent Pattern Mining Algorithm for Semi-structured Data Stream
Fu, Weiqi
Liao, Husheng
Jin, Xueyun
PROCEEDINGS OF THE 2017 5TH INTERNATIONAL CONFERENCE ON FRONTIERS OF MANUFACTURING SCIENCE AND MEASURING TECHNOLOGY (FMSMT 2017), 2017, 130 : 1329 - 1336
[19] Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid computing system
Yu, Kun-Ming
Zhou, Jiayi
EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (03) : 2486 - 2494
[20] Tidset-based parallel FP-tree algorithm for the frequent pattern mining problem on PC clusters
Zhou, Jiayi
Yu, Kun-Ming
ADVANCES IN GRID AND PERVASIVE COMPUTING, PROCEEDINGS, 2008, 5036 : 18 - 28

← 1 2 3 →