Optimizing Near-Data Processing for Spark

被引:1
|
作者
Rachuri, Sri Pramodh [1 ]
Gantasala, Arun [1 ]
Emanuel, Prajeeth [1 ]
Gandhi, Anshul [1 ]
Foley, Robert [2 ]
Puhov, Peter [2 ]
Gkountouvas, Theodoros [3 ]
Lei, Hui [3 ]
机构
[1] SUNY Stony Brook, Stony Brook, NY 11794 USA
[2] FutureWei, Santa Clara, CA USA
[3] OpenInfra Labs, London, England
来源
2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2022) | 2022年
基金
美国国家科学基金会;
关键词
resource disaggregation; near-data processing; spark; pushdown; modeling;
D O I
10.1109/ICDCS54860.2022.00067
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Resource disaggregation (RD) is an emerging paradigm for data center computing whereby resource-optimized servers are employed to minimize resource fragmentation and improve resource utilization. Apache Spark deployed under the RD paradigm employs a cluster of compute-optimized servers to run executors and a cluster of storage-optimized servers to host the data on HDFS. However, the network transfer from storage to compute cluster becomes a severe bottleneck for big data processing. Near-data processing (NDP) is a concept that aims to alleviate network load in such cases by offloading (or "pushing down") some of the compute tasks to the storage cluster. Employing NDP for Spark under the RD paradigm is challenging because storage-optimized servers have limited computational resources and cannot host the entire Spark processing stack. Further, even if such a lightweight stack could be developed and deployed on the storage cluster, it is not entirely obvious which Spark queries would benefit from pushdown, and which tasks of a given query should be pushed down to storage. This paper presents the design and implementation of a near-data processing system for Spark, SparkNDP, that aims to address the aforementioned challenges. SparkNDP works by implementing novel NDP Spark capabilities on the storage cluster using a lightweight library of SQL operators and then developing an analytical model to help determine which Spark tasks should be pushed down to storage based on the current network and system state. Simulation and prototype implementation results show that SparkNDP can help reduce Spark query execution times when compared to both the default approach of not pushing down any tasks to storage and the outright NDP approach of pushing all tasks to storage.
引用
收藏
页码:636 / 646
页数:11
相关论文
共 50 条
  • [41] Jarvis: Large-scale Server Monitoring with Adaptive Near-data Processing
    Sandur, Atul
    Park, ChanHo
    Volos, Stavros
    Agha, Gul
    Jeon, Myeongjae
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 1408 - 1422
  • [42] Near-Data Source Graph Partitioning
    Chang, Furong
    Guo, Hao
    Ullah, Farhan
    Wang, Haochen
    Zhao, Yue
    Zhang, Haitian
    ELECTRONICS, 2024, 13 (22)
  • [43] Two Reconfigurable NDP Servers: Understanding the Impact of Near-Data Processing on Data Center Applications
    Song, Xiaojia
    Xie, Tao
    Fischer, Stephen
    ACM TRANSACTIONS ON STORAGE, 2021, 17 (04)
  • [44] An Efficient Scheduling Algorithm for Multi-mode Tasks on Near-Data Processing SSDs
    Li, Guo
    Chen, Xianzhang
    Liu, Duo
    Li, Jiali
    Tan, Yujuan
    Ren, Ao
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT VII, 2024, 14493 : 1 - 16
  • [45] HyQA: Hybrid Near-Data Processing Platform for Embedding based Question Answering System
    Liang, Shengwen
    Yuan, Ziming
    Wang, Ying
    Xu, Dawen
    Li, Huawei
    Li, Xiaowei
    2024 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2024,
  • [46] An Architecture for Integrated Near-Data Processors
    Vermij, Erik
    Fiorin, Leandro
    Jongerius, Rik
    Hagleitner, Christoph
    Van Lunteren, Jan
    Bertels, Koen
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2017, 14 (03)
  • [47] Identifying the potential of Near Data Processing for Apache Spark
    Awan, Ahsan Javed
    Ohara, Moriyoshi
    Ayguade, Eduard
    Ishizaki, Kazuaki
    Brorsson, Mats
    Vlassov, Vladimir
    MEMSYS 2017: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS, 2017, : 60 - 67
  • [48] Co-ML: A Case for Collaborative ML Acceleration using Near-data Processing
    Aga, Shaizeen
    Jayasena, Nuwan
    Ignatowski, Mike
    MEMSYS 2019: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS, 2019, : 506 - 517
  • [49] NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models
    Li, Shiyu
    Wang, Yitu
    Hanson, Edward
    Chang, Andrew
    Ki, Yang Seok
    Li, Hai
    Chen, Yiran
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (05) : 1248 - 1261
  • [50] An energy-efficient near-data processing accelerator for DNNs to optimize memory accesses
    Khabbazan, Bahareh
    Sabri, Mohammad
    Riera, Marc
    Gonzalez, Antonio
    JOURNAL OF SYSTEMS ARCHITECTURE, 2025, 159