A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning

被引：2

作者：

Garralda-Barrio, Mariano ^{[1
]}

Eiras-Franco, Carlos ^{[1
]}

Bolon-Canedo, Veronica ^{[1
]}

机构：

[1] Univ A Coruna, CITIC, La Coruna, Spain

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2024年 / 189卷

关键词：

Big data; Workload characterization; Apache spark; Pattern recognition; Machine learning; FEATURE-SELECTION;

D O I：

10.1016/j.jpdc.2024.104881

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Comprehensive workload characterization plays a pivotal role in comprehending Spark applications, as it enables the analysis of diverse aspects and behaviors. This understanding is indispensable for devising downstream tuning objectives, such as performance improvement. To address this pivotal issue, our work introduces a novel and scalable framework for generic Spark workload characterization, complemented by consistent geometric measurements. The presented approach aims to build robust workload descriptors by profiling only quantitative metrics at the application task -level, in a non -intrusive manner. We expand our framework for downstream workload pattern recognition by incorporating unsupervised machine learning techniques: clustering algorithms and feature selection. These techniques significantly improve the process of grouping similar workloads without relying on predefined labels. We effectively recognize 24 representative Spark workloads from diverse domains, including SQL, machine learning, web search, graph, and micro -benchmarks, available in HiBench. Our framework achieves a high accuracy F -Measure score of up to 90.9% and a Normalized Mutual Information of up to 94.5% in similar workload pattern recognition. These scores significantly outperform the results obtained in a comparative analysis with an established workload characterization approach in the literature.

引用

页数：16

共 41 条

[1] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench [J].

Ahmed, N. ;

Barczak, Andre L. C. ;

Susnjak, Teo ;

Rashid, Mohammed A. .

JOURNAL OF BIG DATA, 2020, 7 (01)

[2] A comparison of extrinsic clustering evaluation metrics based on formal constraints [J].

Amigo, Enrique ;

Gonzalo, Julio ;

Artiles, Javier ;

Verdejo, Felisa .

INFORMATION RETRIEVAL, 2009, 12 (04) :461-486

[3]

[Anonymous], 2018, Apache Flink

[4]

[Anonymous], 2018, Monitoring and instrumentation spark

[5]

[Anonymous], 2021, Docker images for apache spark executed on hadoop yarn

[6]

[Anonymous], 2018, Apache Spark-Unified Analytics Engine for Big Data

[7]

Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027

[8] Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads [J].

Awan, Ahsan Javed ;

Brorsson, Mats ;

Vlassov, Vladimir ;

Ayguade, Eduard .

PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCES ON BIG DATA AND CLOUD COMPUTING (BDCLOUD 2016) SOCIAL COMPUTING AND NETWORKING (SOCIALCOM 2016) SUSTAINABLE COMPUTING AND COMMUNICATIONS (SUSTAINCOM 2016) (BDCLOUD-SOCIALCOM-SUSTAINCOM 2016), 2016, :59-66

[9] Feature selection for high-dimensional data [J].

Bolón-Canedo V. ;

Sánchez-Maroño N. ;

Alonso-Betanzos A. .

Progress in Artificial Intelligence, 2016, 5 (2) :65-75

[10] Model-based evaluation of clustering validation measures [J].

Brun, Marcel ;

Sima, Chao ;

Hua, Jianping ;

Lowey, James ;

Carroll, Brent ;

Suh, Edward ;

Dougherty, Edward R. .

PATTERN RECOGNITION, 2007, 40 (03) :807-824

← 1 2 3 4 5 →