Towards unified secure on- and off-line analytics at scale

被引:3
作者
Coetzee, P. [1 ]
Leeke, M. [1 ]
Jarvis, S. [1 ]
机构
[1] Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, W Midlands, England
基金
英国工程与自然科学研究理事会;
关键词
Data science; Analytics; Streaming analysis; Hadoop; Domain specific languages; Data intensive computing;
D O I
10.1016/j.parco.2014.07.004
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data scientists have applied various analytic models and techniques to address the oft-cited problems of large volume, high velocity data rates and diversity in semantics. Such approaches have traditionally employed analytic techniques in a streaming or batch processing paradigm. This paper presents CRUCIBLE, a first-in-class framework for the analysis of large-scale datasets that exploits both streaming and batch paradigms in a unified manner. The CRUCIBLE framework includes a domain specific language for describing analyses as a set of communicating sequential processes, a common runtime model for analytic execution in multiple streamed and batch environments, and an approach to automating the management of cell-level security labelling that is applied uniformly across runtimes. This paper shows the applicability of CRUCIBLE to a variety of state-of-the-art analytic environments, and compares a range of runtime models for their scalability and performance against a series of native implementations. The work demonstrates the significant impact of runtime model selection, including improvements of between 2.3 x and 480x between runtime models, with an average performance gap of just 14x between CRUCIBLE and a suite of equivalent native implementations. (C) 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/).
引用
收藏
页码:738 / 753
页数:16
相关论文
共 24 条
[1]  
Aihkisalo T., 2011, Proceedings of the 2011 IEEE World Congress on Services (SERVICES 2011), P122, DOI 10.1109/SERVICES.2011.61
[2]  
Ali Mohamed, 2010, P 1 INT C EXH COMP G
[3]  
Bell D.E., 1976, ESDTR75306 MITRE COR
[4]   Bigtable: A distributed storage system for structured data [J].
Chang, Fay ;
Dean, Jeffrey ;
Ghemawat, Sanjay ;
Hsieh, Wilson C. ;
Wallach, Deborah A. ;
Burrows, Mike ;
Chandra, Tushar ;
Fikes, Andrew ;
Gruber, Robert E. .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2008, 26 (02)
[5]  
Coetzee P., 2013, P 2013 INT WORKSH DA, P43
[6]  
De Francisci Morales G., 2013, P 22 INT C WORLD WID
[7]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[8]  
Efftinge S., 2006, P ECL MOD S ECL SUMM
[9]  
Fuchs Adam., 2012, Accumulo-extensions to google's bigtable design
[10]  
Golab L, 2009, P 35 SIGMOD C MAN DA