Intermediate Results Materialization Selection and Format for Data-Intensive Flows

被引:3
作者
Faisal Munir, Rana [1 ]
Nadal, Sergi [1 ]
Romero, Oscar [1 ]
Abello, Alberto [1 ]
Jovanovic, Petar [1 ]
Thiele, Maik [2 ]
Lehner, Wolfgang [2 ]
机构
[1] UPC, Barcelona, Spain
[2] TUD, Dresden, Germany
关键词
Big Data; Data-Intensive Flows; Intermediate Results; Data Format; HDFS; MAPREDUCE; OPTIMIZATION; QUERIES; VIEWS;
D O I
10.3233/FI-2018-1734
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions.
引用
收藏
页码:111 / 138
页数:28
相关论文
共 50 条
  • [21] Automated Debugging in Data-Intensive Scalable Computing
    Gulzar, Muhammad Ali
    Interlandi, Matteo
    Han, Xueyuan
    Li, Mingda
    Condie, Tyson
    Kim, Miryung
    PROCEEDINGS OF THE 2017 SYMPOSIUM ON CLOUD COMPUTING (SOCC '17), 2017, : 520 - 534
  • [22] Optimizing Interactive Development of Data-Intensive Applications
    Interlandi, Matteo
    Tetali, Sai Deep
    Gulzar, Muhammad Ali
    Noor, Joseph
    Condie, Tyson
    Kim, Miryung
    Millstein, Todd
    PROCEEDINGS OF THE SEVENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC 2016), 2016, : 510 - 522
  • [23] HyDB: Access Optimization for Data-Intensive Service
    Zhu, Qing
    Qin, Zuoyan
    2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, : 580 - 587
  • [24] Experiences with workflows for automating data-intensive bioinformatics
    Ola Spjuth
    Erik Bongcam-Rudloff
    Guillermo Carrasco Hernández
    Lukas Forer
    Mario Giovacchini
    Roman Valls Guimera
    Aleksi Kallio
    Eija Korpelainen
    Maciej M Kańduła
    Milko Krachunov
    David P Kreil
    Ognyan Kulev
    Paweł P. Łabaj
    Samuel Lampa
    Luca Pireddu
    Sebastian Schönherr
    Alexey Siretskiy
    Dimitar Vassilev
    Biology Direct, 10
  • [25] Data-intensive research in physics: challenges and perspectives
    Meera, B. M.
    Hiremath, Vani
    ANNALS OF LIBRARY AND INFORMATION STUDIES, 2018, 65 (01) : 43 - 49
  • [26] Privacy-Aware Data-Intensive Applications
    Guerriero, Michele
    PROCEEDINGS OF THE 2017 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE'17), 2017, : 1030 - 1033
  • [27] Survey of Scientific Programming Techniques for the Management of Data-Intensive Engineering Environments
    Maria Alvarez-Rodriguez, Jose
    Alor-Hernandez, Giner
    Mejia-Miranda, Jezreel
    SCIENTIFIC PROGRAMMING, 2018, 2018
  • [28] Data-Intensive Scalable Computing for Scientific Applications
    Bryant, Randal E.
    COMPUTING IN SCIENCE & ENGINEERING, 2011, 13 (06) : 25 - 33
  • [29] Optimal Resource Provisioning for Data-intensive Microservices
    Erdei, Roland Mark
    Toka, Laszlo
    PROCEEDINGS OF THE IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM 2022, 2022,
  • [30] Status, challenges and trends of data-intensive supercomputing
    Wei, Jia
    Chen, Mo
    Wang, Longxiang
    Ren, Pei
    Lei, Yujia
    Qu, Yuqi
    Jiang, Qiyu
    Dong, Xiaoshe
    Wu, Weiguo
    Wang, Qiang
    Zhang, Kaili
    Zhang, Xingjun
    CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2022, 4 (02) : 211 - 230