On combining system and machine learning performance tuning for distributed data stream applications

被引:0
作者
Lambros Odysseos
Herodotos Herodotou
机构
[1] Cyprus University of Technology,Department of Electrical Engineering and Computer Engineering and Informatics
来源
Distributed and Parallel Databases | 2023年 / 41卷
关键词
Stream processing; Machine learning; System parameter tuning; Hyper-parameter tuning;
D O I
暂无
中图分类号
学科分类号
摘要
The growing need to identify patterns in data and automate decisions based on them in near-real time, has stimulated the development of new machine learning (ML) applications processing continuous data streams. However, the deployment of ML applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a plethora of system configuration parameters, like degree of parallelism, memory buffer sizes, etc., that have a direct impact on application throughput and/or latency, and need to be optimized. Second, ML models have their own set of hyperparameters that require tuning as they can affect the overall prediction accuracy of the trained model significantly. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This manuscript presents a comprehensive experimental study that combines system configuration and hyperparameter tuning of ML applications over DSPEs. The experimental results reveal unexpected and complex interactions between the choices of system configurations and hyperparameters, and their impact on both application and model performance. These insights motivate the need for new combined system and ML model tuning approaches, and open up new research directions in the field of self-managing distributed stream processing systems.
引用
收藏
页码:411 / 438
页数:27
相关论文
共 66 条
[1]  
Herodotou H(2020)A survey on automatic parameter tuning for big data processing systems ACM Comput. Surv. 53 1-37
[2]  
Chen Y(2014)Libol: a library for online learning algorithms J. Mach. Learn. Res. 15 495-1973
[3]  
Lu J(2019)Speedup your analytics: automatic parameter tuning for databases and big data systems PVLDB 12 1970-2681
[4]  
Hoi SC(2017)Automating characterization deployment in distributed data stream management systems IEEE Trans. Knowl. Data Eng. 29 2669-2970
[5]  
Wang J(2015)Efficient and robust automated machine learning Adv. Neural Inf. Process. Syst. 28 2962-33
[6]  
Zhao P(2017)A stepwise auto-profiling method for performance optimization of streaming applications ACM Trans. Auton. Adapt. Syst. 12 1-364
[7]  
Lu J(2016)Performance modeling and predictive scheduling for distributed stream data processing IEEE Trans. Big Data 2 353-117
[8]  
Chen Y(2018)Adaptive performance model for dynamic scaling apache spark streaming Procedia Comput. Sci. 136 109-61
[9]  
Herodotou H(2020)Auto-Sklearn 2.0: hands-free AutoML via meta-learning J. Mach. Learn. Res. 23 1-2830
[10]  
Babu S(2011)Scikit-learn: machine learning in Python J. Mach. Learn. Res. 12 2825-18