Adaptive incremental transfer learning for efficient performance modeling of big data workloads

被引:0
作者
Garralda-Barrio, Mariano [1 ]
Eiras-Franco, Carlos [1 ]
Bolon-Canedo, Veronica [1 ]
机构
[1] Univ A Coruna, CITIC, La Coruna, Spain
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2025年 / 166卷
关键词
Performance modeling; Big data; Machine learning; Apache spark; Distributed computing;
D O I
10.1016/j.future.2025.107730
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The rise of data-intensive scalable computing systems, such as Apache Spark, has transformed data processing by enabling the efficient manipulation of large datasets across machine clusters. However, system configuration to optimize performance remains a challenge. This paper introduces an adaptive incremental transfer learning approach to predicting workload execution times. By integrating both unsupervised and supervised learning, we develop models that adapt incrementally to new workloads and configurations. To guide the optimal selection of relevant workloads, the model employs the coefficient of distance variation (CdV) and the coefficient of quality correlation (CqC), combined in the exploration-exploitation balance coefficient (EEBC). Comprehensive evaluations demonstrate the robustness and reliability of our model for performance modeling in Spark applications, with average improvements of up to 31% over state-of-the-art methods. This research contributes to efficient performance tuning systems by enabling transfer learning from historical workloads to new, previously unseen workloads. The full source code is openly available.
引用
收藏
页数:17
相关论文
共 37 条
[1]  
Brown, 1998, APPL MULTIVARIATE ST, P155, DOI DOI 10.1007/978-3-642-80328-4_13
[2]   A gray-box performance model for Apache Spark [J].
Chao, Zemin ;
Shi, Shengfei ;
Gao, Hong ;
Luo, Jizhou ;
Wang, Hongzhi .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 89 :58-67
[3]   Fast neighbor search by using revised k-d tree [J].
Chen, Yewang ;
Zhou, Lida ;
Tang, Yi ;
Singh, Jai Puneet ;
Bouguila, Nizar ;
Wang, Cheng ;
Wang, Huazhen ;
Du, Jixiang .
INFORMATION SCIENCES, 2019, 472 :145-162
[4]   SimCost: cost-effective resource provision prediction and recommendation for spark workloads [J].
Chen, Yuxing ;
Hoque, Mohammad A. ;
Xu, Pengfei ;
Lu, Jiaheng ;
Tarkoma, Sasu .
DISTRIBUTED AND PARALLEL DATABASES, 2024, 42 (01) :73-102
[5]   Efficient Performance Prediction for Apache Spark [J].
Cheng, Guoli ;
Ying, Shi ;
Wang, Bingming ;
Li, Yuhang .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 149 :40-51
[6]   TurBO: A cost-efficient configuration-based auto-tuning approach for cluster-based big data frameworks [J].
Dou, Hui ;
Zhang, Lei ;
Zhang, Yiwen ;
Chen, Pengfei ;
Zheng, Zibin .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2023, 177 :89-105
[7]   Fast Distributed kNN Graph Construction Using Auto-tuned Locality-sensitive Hashing [J].
Eiras-Franco, Carlos ;
Martinez-Rego, David ;
Kanthan, Leslie ;
Pineiro, Cesar ;
Bahamonde, Antonio ;
Guijarro-Berdinas, Bertha ;
Alonso-Betanzos, Amparo .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2020, 11 (06)
[8]   To Tune or Not to Tune? In Search of Optimal Configurations for Data Analytics [J].
Fekry, Ayat ;
Carata, Lucian ;
Pasquier, Thomas ;
Rice, Andrew ;
Hopper, Andy .
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, :2494-2504
[9]   A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning [J].
Garralda-Barrio, Mariano ;
Eiras-Franco, Carlos ;
Bolon-Canedo, Veronica .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 189
[10]   Performance Prediction for Data-driven Workflows on Apache Spark [J].
Gulino, Andrea ;
Canakoglu, Arif ;
Ceri, Stefano ;
Ardagna, Danilo .
2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, :167-+