An Architecture for Compiling UDF-centric Workflows

被引:3
作者
Crotty, Andrew [1 ]
Galakatos, Alex [1 ]
Dursun, Kayhan [1 ]
Kraska, Tim [1 ]
Binnig, Carsten [1 ]
Cetintemel, Ugur [1 ]
Zdonik, Stan [1 ]
机构
[1] Brown Univ, Dept Comp Sci, Providence, RI 02912 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2015年 / 8卷 / 12期
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data analytics has recently grown to include increasingly sophisticated techniques, such as machine learning and advanced statistics. Users frequently express these complex analytics tasks as workflows of user-defined functions (UDFs) that specify each algorithmic step. However, given typical hardware configurations and dataset sizes, the core challenge of complex analytics is no longer sheer data volume but rather the computation itself, and the next generation of analytics frameworks must focus on optimizing for this computation bottleneck. While query compilation has gained widespread popularity as a way to tackle the computation bottleneck for traditional SQL workloads, relatively little work addresses UDF-centric workflows in the domain of complex analytics. In this paper, we describe a novel architecture for automatically compiling workflows of UDFs. We also propose several optimizations that consider properties of the data, UDFs, and hardware together in order to generate different code on a case-by-case basis. To evaluate our approach, we implemented these techniques in TUPLEWARE, a new high-performance distributed analytics system, and our benchmarks show performance improvements of up to three orders of magnitude compared to alternative systems.
引用
收藏
页码:1466 / 1477
页数:12
相关论文
共 38 条
[1]   ASTERIX: An Open Source System for "Big Data" Management and Analysis (Demo) [J].
Alsubaiee, Sattam ;
Altowim, Yasser ;
Altwaijry, Hotham ;
Behm, Alexander ;
Borkar, Vinayak ;
Bu, Yingyi ;
Carey, Michael ;
Grover, Raman ;
Heilbron, Zachary ;
Kim, Young-Seok ;
Li, Chen ;
Onose, Nicola ;
Pirzadeh, Pouria ;
Vernica, Rares ;
Wen, Jian .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12) :1898-1901
[2]  
ASTRAHAN MM, 1979, COMPUTER, V12, P42, DOI 10.1109/MC.1979.1658743
[3]  
Borkar Vinayak, 2012, IEEE DATA ENG B, V35, P24
[4]  
Bu YY, 2010, PROC VLDB ENDOW, V3, P285
[5]  
Cafarella M. J., 2010, WORKSH WEB DAT, P10
[6]  
Canny J, 2013, 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), P95
[7]   SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets [J].
Chaiken, Ronnie ;
Jenkins, Bob ;
Larson, Per-Ake ;
Ramsey, Bill ;
Shakib, Darren ;
Weaver, Simon ;
Zhou, Jingren .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02) :1265-1276
[8]  
Chattopadhyay B, 2011, PROC VLDB ENDOW, V4, P1318
[9]   Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads [J].
Chen, Yanpei ;
Alspaugh, Sara ;
Katz, Randy .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12) :1802-1813
[10]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137