Characterization of a Big Data Storage Workload in the Cloud

被引:8
作者
Talluri, Sacheendra [1 ,5 ]
Luszczak, Alicja [2 ]
Abad, Cristina L. [3 ]
Iosup, Alexandru [4 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
[2] Databricks BV, Amsterdam, Netherlands
[3] Escuela Super Politecn Litoral, Guayaquil, Ecuador
[4] Vrije Univ, Amsterdam, Netherlands
[5] Databricks, Amsterdam, Netherlands
来源
PROCEEDINGS OF THE 2019 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '19) | 2019年
关键词
D O I
10.1145/3297663.3310302
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.
引用
收藏
页码:33 / 44
页数:12
相关论文
共 50 条
[41]   Parallel Proxy Re-Encryption Workload Distribution for Efficient Big Data Sharing in Cloud Computing [J].
Khashan, Osama A. .
2021 IEEE 11TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2021, :554-559
[42]   Distributed deduplication with fingerprint index management model for big data storage in the cloud [J].
S. Sabeetha Saraswathi ;
N. Malarvizhi .
Evolutionary Intelligence, 2021, 14 :683-690
[43]   PPSecS: Privacy-Preserving Secure Big Data Storage in a Cloud Environment [J].
Imene Bouleghlimat ;
Souheila Boudouda ;
Salima Hacini .
Arabian Journal for Science and Engineering, 2024, 49 :3225-3239
[44]   Perlustration on Techno Level Classification of Deduplication Techniques in Cloud for Big Data Storage [J].
Karthika, R. N. ;
Valliyammai, C. ;
Abisha, D. .
2016 EIGHTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (ICOAC), 2017, :206-211
[45]   An Authorized Public Auditing Scheme for Dynamic Big Data Storage in Cloud Computing [J].
Yu, Han ;
Lu, Xiuqing ;
Pan, Zhenkuan .
IEEE ACCESS, 2020, 8 (08) :151465-151473
[46]   Enhancing multi-cloud security: novel method for big data storage [J].
Preethi ;
Bisht, Sover Singh ;
Kundra, Danish ;
Kandhari, Harsimrat ;
Singh, Poonam ;
Kaushik, Harshita .
INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2025,
[47]   PPSecS: Privacy-Preserving Secure Big Data Storage in a Cloud Environment [J].
Bouleghlimat, Imene ;
Boudouda, Souheila ;
Hacini, Salima .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024, 49 (03) :3225-3239
[48]   Construction of Social Network Big Data Storage Model Under Cloud Computing [J].
Jin, Zihui ;
Chen, Ting .
ADVANCED HYBRID INFORMATION PROCESSING, ADHIP 2022, PT I, 2023, 468 :205-216
[49]   Towards Lightweight and Swift Storage Resource Management in Big Data Cloud Era [J].
Zhou, Ruijin ;
Chen, Huixiang ;
Li, Tao .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, :133-142
[50]   Secure Model based on Multi-cloud for Big Data Storage and Query [J].
Yang, Zhendong ;
Wang, Liangmin ;
Song, Xiangmei .
2016 FOURTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD 2016), 2016, :207-214