A Hybrid Machine Learning Approach for Performance Modeling of Cloud-Based Big Data Applications

被引：4

作者：

Ataie, Ehsan ^{[1
,2
]}

Evangelinou, Athanasia ^{[3
]}

Gianniti, Eugenio ^{[3
]}

Ardagna, Danilo ^{[3
]}

机构：

[1] Univ Mazandaran, Dept Comp Engn, Babolsar, Iran

[2] Univ Mazandaran, Distributed Comp Syst Res Grp, Babolsar, Iran

[3] Politecn Milan, Dipartimento Elettron Informaz & Bioingn, Milan, Italy

来源：

COMPUTER JOURNAL | 2022年 / 65卷 / 12期

关键词：

analytical performance modeling; machine learning; cloud computing; MapReduce; Hadoop; Tez; Spark;

D O I：

10.1093/comjnl/bxab131

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Nowadays, Apache Hadoop and Apache Spark are two of the most prominent distributed solutions for processing big data applications on the market. Since in many cases these frameworks are adopted to support business critical activities, it is often important to predict with fair confidence the execution time of submitted applications, for instance when service-level agreements are established with end-users. In this work, we propose and validate a hybrid approach for the performance prediction of big data applications running on clouds, which exploits both analytical modeling and machine learning (ML) techniques and it is able to achieve a good accuracy without too many time consuming and costly experiments on a real setup. The experimental results show how the proposed approach attains improvement in accuracy, number of experiments to be run on the operational system and cost over applying ML techniques without any support from analytical models. Moreover, we compare our approach with Ernest, an ML-based technique proposed in the literature by the Spark inventors. Experiments show that Ernest can accurately estimate the performance in interpolating scenarios while it fails to predict the performance when configurations with increasing number of cores are considered. Finally, a comparison with a similar hybrid approach proposed in the literature demonstrates how our approach significantly reduce prediction errors especially when few experiments on the real system are performed.

引用

页码：3123 / 3140

页数：18

共 45 条

[1]

Alipourfard O, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P469

[2] Integrating and Querying OpenStreetMap and Linked Geo Open Data [J].

Almendros-Jimenez, Jesus M. ;

Becerra-Teron, Antonio ;

Torres, Manuel .

COMPUTER JOURNAL, 2019, 62 (03) :321-345

[3] Performance Prediction of Cloud-Based Big Data Applications [J].

Ardagna, Danilo ;

Barbierato, Enrico ;

Evangelinou, Athanasia ;

Gianniti, Eugenio ;

Gribaudo, Marco ;

Pinto, Tulio B. M. ;

Guimaraes, Anna ;

da Silva, Ana Paula Couto ;

Almeida, Jussara M. .

PROCEEDINGS OF THE 2018 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '18), 2018, :192-199

[4] Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets [J].

Ardagna, Danilo ;

Bernardi, Simona ;

Gianniti, Eugenio ;

Karimian-Aliabadi, Soroush ;

Perez-Palacin, Diego ;

Ignacio Requeno, Jose .

ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2016, 2016, 10048 :599-613

[5]

Ardagna D, 2008, LECT NOTES COMPUT SC, V5281, P1, DOI 10.1007/978-3-540-87879-7_1

[6] A survey of cross-validation procedures for model selection [J].

Arlot, Sylvain ;

Celisse, Alain .

STATISTICS SURVEYS, 2010, 4 :40-79

[7]

Ataie E, 2016, INT SYMP SYMB NUMERI, P431, DOI [10.1109/SYNASC.2016.072, 10.1109/SYNASC.2016.66]

[8]

Bertoli Marco, 2009, Performance Evaluation Review, V36, P10, DOI 10.1145/1530873.1530877

[9] Blending randomness in closed queueing network models [J].

Casale, Giuliano ;

Tribastone, Mirco ;

Harrison, Peter G. .

PERFORMANCE EVALUATION, 2014, 82 :15-38

[10] LIBSVM: A Library for Support Vector Machines [J].

Chang, Chih-Chung ;

Lin, Chih-Jen .

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)

← 1 2 3 4 5 →