Adaptive incremental transfer learning for efficient performance modeling of big data workloads

被引：0

作者：

Garralda-Barrio, Mariano ^{[1
]}

Eiras-Franco, Carlos ^{[1
]}

Bolon-Canedo, Veronica ^{[1
]}

机构：

[1] Univ A Coruna, CITIC, La Coruna, Spain

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2025年 / 166卷

关键词：

Performance modeling; Big data; Machine learning; Apache spark; Distributed computing;

D O I：

10.1016/j.future.2025.107730

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The rise of data-intensive scalable computing systems, such as Apache Spark, has transformed data processing by enabling the efficient manipulation of large datasets across machine clusters. However, system configuration to optimize performance remains a challenge. This paper introduces an adaptive incremental transfer learning approach to predicting workload execution times. By integrating both unsupervised and supervised learning, we develop models that adapt incrementally to new workloads and configurations. To guide the optimal selection of relevant workloads, the model employs the coefficient of distance variation (CdV) and the coefficient of quality correlation (CqC), combined in the exploration-exploitation balance coefficient (EEBC). Comprehensive evaluations demonstrate the robustness and reliability of our model for performance modeling in Spark applications, with average improvements of up to 31% over state-of-the-art methods. This research contributes to efficient performance tuning systems by enabling transfer learning from historical workloads to new, previously unseen workloads. The full source code is openly available.

引用

页数：17

共 37 条

[1]

Brown, 1998, APPL MULTIVARIATE ST, P155, DOI DOI 10.1007/978-3-642-80328-4_13

[2] A gray-box performance model for Apache Spark [J].