Characterization of a Big Data Storage Workload in the Cloud

被引:6
作者
Talluri, Sacheendra [1 ,5 ]
Luszczak, Alicja [2 ]
Abad, Cristina L. [3 ]
Iosup, Alexandru [4 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
[2] Databricks BV, Amsterdam, Netherlands
[3] Escuela Super Politecn Litoral, Guayaquil, Ecuador
[4] Vrije Univ, Amsterdam, Netherlands
[5] Databricks, Amsterdam, Netherlands
来源
PROCEEDINGS OF THE 2019 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '19) | 2019年
关键词
D O I
10.1145/3297663.3310302
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.
引用
收藏
页码:33 / 44
页数:12
相关论文
共 50 条
  • [21] Secure Big Data Storage and Sharing Scheme for Cloud Tenants
    Cheng Hongbing
    Rong Chunming
    Hwang Kai
    Wang Weihong
    Li Yanyan
    CHINA COMMUNICATIONS, 2015, 12 (06) : 106 - 115
  • [22] Cost-Effective, Workload-Adaptive Migration of Big Data Applications to the Cloud
    Giannakouris, Victor
    Fernandez, Alejandro
    Simitsis, Alkis
    Babu, Shivnath
    SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2019, : 1909 - 1912
  • [23] A survey on data storage and placement methodologies for Cloud-Big Data ecosystem
    Somnath Mazumdar
    Daniel Seybold
    Kyriakos Kritikos
    Yiannis Verginadis
    Journal of Big Data, 6
  • [24] A survey on data storage and placement methodologies for Cloud-Big Data ecosystem
    Mazumdar, Somnath
    Seybold, Daniel
    Kritikos, Kyriakos
    Verginadis, Yiannis
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [25] Automated Workload Characterization in Cloud-based Transactional Data Grids
    Ciciani, Bruno
    Didona, Diego
    Di Sanzo, Pierangelo
    Palmieri, Roberto
    Peluso, Sebastiano
    Quaglia, Francesco
    Romano, Paolo
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 1525 - 1533
  • [26] Workload-aware storage policies for cloud object storage
    Chen, Yu
    Tong, Wei
    Feng, Dan
    Wang, Zike
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2022, 163 : 232 - 247
  • [27] Workload Management for Big Data Analytics
    Aboulnaga, Ashraf
    Babu, Shivnath
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 1249 - 1249
  • [28] Cloud storage reliability for Big Data applications: A state of the art survey
    Nachiappan, Rekha
    Javadi, Bahman
    Calheiros, Rodrigo N.
    Matawie, Kenan M.
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2017, 97 : 35 - 47
  • [29] SecSVA: Secure Storage, Verification, and Auditing of Big Data in the Cloud Environment
    Aujla, Gagangeet Singh
    Chaudhary, Rajat
    Kumar, Neeraj
    Das, Ashok Kumar
    Rodrigues, Joel J. P. C.
    IEEE COMMUNICATIONS MAGAZINE, 2018, 56 (01) : 78 - 85
  • [30] Blockchain-based public auditing for big data in cloud storage
    Li, Jiaxing
    Wu, Jigang
    Jiang, Guiyuan
    Srikanthan, Thambipillai
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (06)