Characterizing Distributed Machine Learning Workloads on Apache Spark (Experimentation and Deployment Paper)

被引：1

作者：

Djebrouni, Yasmine ^{[1
]}

Rocha, Isabelly ^{[2
]}

Bouchenak, Sara ^{[3
]}

Chen, Lydia ^{[2
,4
]}

Felber, Pascal ^{[2
]}

Marangozova, Vania ^{[1
]}

Schiavoni, Valerio ^{[2
]}

机构：

[1] Univ Grenoble Alps, Grenoble, France

[2] Univ Neuchatel, Neuchatel, Switzerland

[3] INSA Lyon, Lyon, France

[4] Delft Univ Technol, Delft, Netherlands

来源：

PROCEEDINGS OF THE 24TH ACM/IFIP INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2023 | 2023年

关键词：

Distributed Machine Learning; Distributed Deep Learning; Trace Collection; Workload Characterization; Multi-level Configuration; PERFORMANCE;

D O I：

10.1145/3590140.3629112

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Distributed machine learning (DML) environments are widely used in many application domains to build decision-making systems. However, the complexity of these environments is overwhelming for novice users. On the one hand, data scientists are more familiar with hyper-parameter tuning and typically lack an understanding of the trade-offs and challenges of parameterizing DML platforms to achieve good performance. On the other hand, system administrators focus on tuning distributed platforms, unaware of the possible implications of the platform on the quality of the learning models. To shed light on such parameter configuration interplay, we run multiple DML workloads on the widely used Apache Spark distributed platform, leveraging 13 popular learning methods and 6 real-world datasets on two distinct clusters. We collect and perform an in-depth analysis of workload execution traces to compare the efficiency of different configuration strategies. We consider tuning only hyper-parameters, tuning only platform parameters, and jointly tuning both hyper-parameters and platform parameters. We publicly release our collected traces and derive key takeaways on DML workloads. Counter-intuitively, platform parameters have a higher impact on the model quality than hyper-parameters. More generally, we show that multi-level parameter configuration can provide better results in terms of model quality and execution time while also optimizing resource costs.

引用

页码：151 / 164

页数：14

共 8 条

[1] Model averaging in distributed machine learning: a case study with Apache Spark
Guo, Yunyan
Zhang, Zhipeng
Jiang, Jiawei
Wu, Wentao
Zhang, Ce
Cui, Bin
Li, Jianzhong
VLDB JOURNAL, 2021, 30 (04): : 693 - 712
[2] Predicting Diabetes using Distributed Machine Learning based on Apache Spark
Ahmed, Hager
Younis, Eman M. G.
Ali, Abdelmgeid A.
PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN COMMUNICATION AND COMPUTER ENGINEERING (ITCE), 2020, : 44 - 49
[3] Model averaging in distributed machine learning: a case study with Apache Spark
Yunyan Guo
Zhipeng Zhang
Jiawei Jiang
Wentao Wu
Ce Zhang
Bin Cui
Jianzhong Li
The VLDB Journal, 2021, 30 : 693 - 712
[4] Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
Jangda, Abhinav
Huang, Jun
Liu, Guodong
Sabet, Amir Hossein Nodehi
Maleki, Saeed
Miao, Youshan
Musuvathi, Madanlal
Mytkowicz, Todd
Saarikivi, Olli
ASPLOS '22: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2022, : 402 - 416
[5] dSyncPS: Delayed Synchronization for Dynamic Deployment of Distributed Machine Learning
Guo, Yibo
Wang, An
PROCEEDINGS OF THE 2022 2ND EUROPEAN WORKSHOP ON MACHINE LEARNING AND SYSTEMS (EUROMLSYS '22), 2022, : 79 - 86
[6] Distributed Deep Learning for Big Remote Sensing Data Processing on Apache Spark: Geological Remote Sensing Interpretation as a Case Study
Long, Ao
Han, Wei
Huang, Xiaohui
Li, Jiabao
Wang, Yuewei
Chen, Jia
WEB AND BIG DATA, PT I, APWEB-WAIM 2023, 2024, 14331 : 96 - 110
[7] Screening hardware and volume factors in distributed machine learning algorithms on spark A design of experiments (DoE) based approach
Rodrigues, Jairson B.
Vasconcelos, Germano C.
Maciel, Paulo R. M.
COMPUTING, 2021, 103 (10) : 2203 - 2225
[8] Comparative Analysis on the Deployment of Machine Learning Algorithms in the Distributed Brillouin Optical Time Domain Analysis (BOTDA) Fiber Sensor
Nordin, Nur Dalilla
Zan, Mohd Saiful Dzulkefly
Abdullah, Fairuz
PHOTONICS, 2020, 7 (04)

← 1 →