Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage Systems

被引：3

作者：

Herodotou, Herodotos ^{[1
]}

Kakoulli, Elena ^{[1
,2
]}

机构：

[1] Cyprus Univ Technol, 30 Arch Kyprianos Str, CY-3036 Limassol, Cyprus

[2] Neapolis Univ Pafos, 2 Danais Ave, CY-8042 Pafos, Cyprus

来源：

ACM TRANSACTIONS ON DATABASE SYSTEMS | 2023年 / 48卷 / 04期

关键词：

Distributed file systems; tiered storage; data prefetching; task scheduling; DATA LOCALITY; MAPREDUCE; OPTIMIZATION; PERFORMANCE; EFFICIENT;

D O I：

10.1145/3625389

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The use of storage tiering is becoming popular in data-intensive compute clusters due to the recent advancements in storage technologies. The Hadoop Distributed File System, for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, current big data platforms (such as Hadoop and Spark) are not exploiting the presence of storage tiers and the opportunities they present for performance optimizations. Specifically, schedulers and prefetchers will make decisions only based on data locality information and completely ignore the fact that local data are now stored on a variety of storage media with different performance characteristics. This article presents Trident, a scheduling and prefetching framework that is designed to make task assignment, resource scheduling, and prefetching decisions based on both locality and storage tier information. Trident formulates task scheduling as aminimum cost maximummatching problem in a bipartite graph and utilizes two novel pruning algorithms for bounding the size of the graph, while still guaranteeing optimality. In addition, Trident extends YARN's resource request model and proposes a new storage-tier-aware resource scheduling algorithm. Finally, Trident includes a cost-based data prefetching approach that coordinates with the schedulers for optimizing prefetching operations. Trident is implemented in both Spark and Hadoop and evaluated extensively using a realistic workload derived from Facebook traces as well as an industry-validated benchmark, demonstrating significant benefits in terms of application performance and cluster efficiency.

引用

页数：40

共 50 条

[1] Trident: Task Scheduling over Tiered Storage Systems in Big Data Platforms
Herodotou, Herodotos
Kakoulli, Elena
PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (09): : 1570 - 1582
[2] Data Prefetching and Eviction Mechanisms of In-Memory Storage Systems Based on Scheduling for Big Data Processing
Chen, Chien-Hung
Hsia, Ting-Yuan
Huang, Yennun
Kuo, Sy-Yen
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (08) : 1738 - 1752
[3] Task Scheduling in Big Data Platforms: A Systematic Literature Review
Soualhia, Mbarka
Khomh, Foutse
Tahar, Sofiene
JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 134 : 170 - 189
[4] Adaptive cache policy scheduling for big data applications on distributed tiered storage system
Gu, Rong
Li, Chongjie
Shu, Peng
Yuan, Chunfeng
Huang, Yihua
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (15)
[5] DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters
Jin, Jiahui
An, Qi
Zhou, Wei
Tang, Jiakai
Xiong, Runqun
APPLIED SCIENCES-BASEL, 2018, 8 (11):
[6] Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment
Li, Hongjian
Zhu, Lisha
Wang, Shuaicheng
Wang, Lei
JOURNAL OF GRID COMPUTING, 2023, 21 (03)
[7] Research On Tiered Storage Method For Big Data Of Virtual Information Based On Cloud Computing
Chen, Ping
Liu, Jianlan
Liu, Xing
Zheng, Ruiying
Pan, Yongyan
2019 INTERNATIONAL CONFERENCE ON SMART GRID AND ELECTRICAL AUTOMATION (ICSGEA), 2019, : 308 - 311
[8] ExaPlan: Efficient Queueing-Based Data Placement, Provisioning, and Load Balancing for Large Tiered Storage Systems
Iliadis, Ilias
Jelitto, Jens
Kim, Yusik
Sarafijanovic, Slavisa
Venkatesan, Vinodh
ACM TRANSACTIONS ON STORAGE, 2017, 13 (02)
[9] Streaming Machine Learning for Supporting Data Prefetching in Modern Data Storage Systems
Lucas Filho, Edson Ramiro
Yang, Lun
Fu, Kebo
Herodotou, Herodotos
PROCEEDINGS OF THE 1ST WORKSHOP ON AI FOR SYSTEMS, AI4SYS 2023, 2023, : 7 - 12
[10] A Dynamic Resource Allocation Method for Load-Balance Scheduling over Big Data Platforms
Tang, Wenda
Liu, Xiang
Rafique, Wajid
Dou, Wanchun
IEEE 2018 INTERNATIONAL CONGRESS ON CYBERMATICS / 2018 IEEE CONFERENCES ON INTERNET OF THINGS, GREEN COMPUTING AND COMMUNICATIONS, CYBER, PHYSICAL AND SOCIAL COMPUTING, SMART DATA, BLOCKCHAIN, COMPUTER AND INFORMATION TECHNOLOGY, 2018, : 524 - 531

← 1 2 3 4 5 →