Evaluating Fault Tolerance of Distributed Stream Processing Systems

被引:0
作者
Wang, Xiaotong [1 ]
Jiang, Cheng [1 ]
Fang, Junhua [2 ]
Shu, Ke [3 ]
Zhang, Rong [1 ]
Qian, Weining [1 ]
Zhou, Aoying [1 ]
机构
[1] East China Normal Univ, Sch Data Sci & Engn, Shanghai, Peoples R China
[2] Soochow Univ, Suzhou, Peoples R China
[3] PingCAP Ltd, Shanghai, Peoples R China
来源
WEB AND BIG DATA, PT II, APWEB-WAIM 2020 | 2020年 / 12318卷
基金
美国国家科学基金会;
关键词
Fault tolerance; Benchmarking; Stream processing;
D O I
10.1007/978-3-030-60290-1_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since failures in large-scale clusters can lead to severe performance degradation and break system availability, fault tolerance is critical for distributed stream processing systems (DSPSs). Plenty of fault tolerance approaches have been proposed over the last decade. However, there is no systematic work to evaluate and compare them in detail. Previous work either evaluates global performance during failure-free runtime, or merely measures throughout loss when failure happens. In this paper, it is the first work proposing an evaluation framework customized for quantitatively comparing runtime overhead and recovery efficiency of fault tolerance mechanisms in DSPSs. We define three typical configurable workloads, which are widely-adopted in previous DSPS evaluations. We construct five workload suites based on three workloads to investigate the effects of different factors on fault tolerance performance. We carry out extensive experiments on two well-known open-sourced DSPSs. The results demonstrate performance gap of two systems, which is useful for choice and evolution of fault tolerance approaches.
引用
收藏
页码:101 / 116
页数:16
相关论文
共 22 条
[1]   Aurora: a new model and architecture for data stream management [J].
Abadi, DJ ;
Carney, D ;
Cetintemel, U ;
Cherniack, M ;
Convey, C ;
Lee, S ;
Stonebraker, M ;
Tatbul, N ;
Zdonik, S .
VLDB JOURNAL, 2003, 12 (02) :120-139
[2]  
Arasu A., 2004, Proceedings of the Thirtieth International Conference on Very Large Data Bases, V30, P480
[3]  
Balazinska Magdalena, 2005, P 2005 ACM SIGMOD IN, P13
[4]  
Bordin M V., 2017, Ph.D. thesis
[5]  
Carbone P., 2015, Bull. Tech. Comm. Data Eng., V38, P28, DOI DOI 10.1109/IC2EW.2016.56
[6]   Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming [J].
Chintapalli, Sanket ;
Dagit, Derek ;
Evans, Bobby ;
Farivar, Reza ;
Graves, Thomas ;
Holderbaugh, Mark ;
Liu, Zhuo ;
Nusbaum, Kyle ;
Patil, Kishorkumar ;
Peng, Boyang Jerry ;
Poulosky, Paul .
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, :1789-1792
[7]  
Fernandez Raul Castro, 2013, SIGMOD, P725
[8]   Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications [J].
Gill, Phillipa ;
Jain, Navendu ;
Nagappan, Nachiappan .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) :350-361
[9]  
Grier Jamie., 2016, Extending the yahoo! streaming benchmark
[10]  
Heinze T., 2015, DEBS, P150, DOI DOI 10.1145/2675743.2771831