Apache Nemo: A Framework for Optimizing Distributed Data Processing

被引:3
作者
Song, Won Wook [1 ]
Yang, Youngseok [1 ]
Eo, Jeongyoon [1 ]
Seo, Jangho [2 ]
Kim, Joo Yeon [3 ]
Lee, Sanha [2 ]
Lee, Gyewon [1 ]
Um, Taegeon [1 ]
Cho, Haeyoon [1 ]
Chun, Byung-Gon [1 ]
机构
[1] Seoul Natl Univ, Comp Sci & Engn Dept, 1 Gwanak Ro, Seoul 08826, South Korea
[2] Samsung Elect, 56 Seongchon Gil, Seoul 06765, South Korea
[3] Naver Corp, 6 Buljeong Ro, Seongnam Si 13561, Gyeonggi Do, South Korea
来源
ACM TRANSACTIONS ON COMPUTER SYSTEMS | 2021年 / 38卷 / 3-4期
基金
新加坡国家研究基金会;
关键词
Data processing; distributed systems;
D O I
10.1145/3468144
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.
引用
收藏
页数:31
相关论文
共 47 条
[1]  
[Anonymous], 2012, P 3 ACM S CLOUD COMP
[2]   Spark SQL: Relational Data Processing in Spark [J].
Armbrust, Michael ;
Xin, Reynold S. ;
Lian, Cheng ;
Huai, Yin ;
Liu, Davies ;
Bradley, Joseph K. ;
Meng, Xiangrui ;
Kaftan, Tomer ;
Franklint, Michael J. ;
Ghodsi, Ali ;
Zaharia, Matei .
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :1383-1394
[3]   Rock You like a Hurricane: Taming Skew in Large Scale Analytics [J].
Bindschaedler, Laurent ;
Malicevic, Jasmina ;
Schiper, Nicolas ;
Goel, Ashvin ;
Zwaenepoel, Willy .
EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE, 2018,
[4]   Power Hints for Query Optimization [J].
Bruno, Nicolas ;
Chaudhuri, Surajit ;
Ramamurthy, Ravi .
ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, :469-480
[5]  
DeWitt D. J., 1990, IEEE Transactions on Knowledge and Data Engineering, V2, P44, DOI 10.1109/69.50905
[6]  
Duboscq G., 2013, P 7 ACM WORKSH VIRT, P1, DOI [10.1145/2542142.2542143, DOI 10.1145/2542142.2542143]
[7]   SpongeFiles: Mitigating Data Skew in MapReduce Using Distributed Memory [J].
Elmeleegy, Khaled ;
Olston, Christopher ;
Reed, Benjamin .
SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, :551-562
[8]   Scalable and Adaptive Online Joins [J].
Elseidy, Mohammed ;
Elguindy, Abdallah ;
Vitorovic, Aleksandar ;
Koch, Christoph .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (06) :441-452
[9]  
Gog Ionel., 2015, Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems, HOTOS'15, P2
[10]  
GRAEFE G, 1990, SIGMOD REC, V19, P102, DOI 10.1145/93605.98720