Adaptive incremental transfer learning for efficient performance modeling of big data workloads

被引:0
作者
Garralda-Barrio, Mariano [1 ]
Eiras-Franco, Carlos [1 ]
Bolon-Canedo, Veronica [1 ]
机构
[1] Univ A Coruna, CITIC, La Coruna, Spain
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2025年 / 166卷
关键词
Performance modeling; Big data; Machine learning; Apache spark; Distributed computing;
D O I
10.1016/j.future.2025.107730
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The rise of data-intensive scalable computing systems, such as Apache Spark, has transformed data processing by enabling the efficient manipulation of large datasets across machine clusters. However, system configuration to optimize performance remains a challenge. This paper introduces an adaptive incremental transfer learning approach to predicting workload execution times. By integrating both unsupervised and supervised learning, we develop models that adapt incrementally to new workloads and configurations. To guide the optimal selection of relevant workloads, the model employs the coefficient of distance variation (CdV) and the coefficient of quality correlation (CqC), combined in the exploration-exploitation balance coefficient (EEBC). Comprehensive evaluations demonstrate the robustness and reliability of our model for performance modeling in Spark applications, with average improvements of up to 31% over state-of-the-art methods. This research contributes to efficient performance tuning systems by enabling transfer learning from historical workloads to new, previously unseen workloads. The full source code is openly available.
引用
收藏
页数:17
相关论文
共 37 条
[21]  
Pedregosa F, 2011, J MACH LEARN RES, V12, P2825
[22]   You Only Run Once: Spark Auto-Tuning From a Single Run [J].
Prats, David Buchaca ;
Portella, Felipe Albuquerque ;
Costa, Carlos H. A. ;
Berral, Josep Lluis .
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2020, 17 (04) :2039-2051
[23]  
Sewal Piyush, 2022, 2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC), P331, DOI 10.1109/PDGC56933.2022.10053356
[24]  
Shah S., 2019, INT CONF NETW SER, P1, DOI [10.23919/CNSM46954.2019.9012752, DOI 10.23919/cnsm46954.2019.9012752]
[25]   Rover: An Online Spark SQL Tuning Service via Generalized Transfer Learning [J].
Shen, Yu ;
Ren, Xinyuyang ;
Lu, Yupeng ;
Jiang, Huaijun ;
Xu, Huanyong ;
Peng, Di ;
Li, Yang ;
Zhang, Wentao ;
Cui, Bin .
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, :4800-4812
[26]  
Singhal R., 2018, Performance Evaluation and Benchmarking for the Analytics Era, P131, DOI DOI 10.1007/978-3-319-72401-0_10
[27]  
spark.apache, 2018, Apache spark-unified engine for large-scale data analytics
[28]  
Thereska E, 2008, PERF E R SI, V36, P253, DOI 10.1145/1384529.1375486
[29]   Machine learning algorithm validation with a limited sample size [J].
Vabalas, Andrius ;
Gowen, Emma ;
Poliakoff, Ellen ;
Casson, Alexander J. .
PLOS ONE, 2019, 14 (11)
[30]  
Van Rossum G., 2009, Python 3 Reference Manual