Scrub: Online TroubleShooting for Large Mission-Critical Applications

被引:4
作者
Satish, Arjun [1 ]
Shiou, Thomas [1 ]
Zhang, Chuck [1 ]
Elmeleegy, Khaled [1 ,2 ]
Zwaenepoel, Willy [3 ]
机构
[1] Turn Inc, Redwood City, CA 94063 USA
[2] Oracle Cloud, Redwood Shores, CA USA
[3] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
来源
EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE | 2018年
关键词
Scrub; Advertising; Mission Critical; Big Data; Query Processing; Troubleshooting; Debugging; Distributed Systems;
D O I
10.1145/3190508.3190513
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
6 Scrub is a troubleshooting tool for distributed applications that operate under strict SLOs common in production environments. It allows users to formulate queries on events occurring during execution in order to assess the correctness of the application's operation. Scrub has been in use for two years at Turn, where developers and users have relied on it to resolve numerous issues in its online advertisement bidding platform. This platform spans thousands of machines across the globe, serving several million bid requests per second, and dispensing many millions of dollars in advertising budgets. Troubleshooting distributed applications is notoriously hard, and its difficulty is exacerbated by the presence of strict SLOs, which requires the troubleshooting tool to have only minimal impact on the hosts running the application. Furthermore, with large amounts of money at stake, users expect to be able to run frequent diagnostics and demand quick evaluation and remediation of any problems. These constraints have led to a number of design and implementation decisions, that go counter to conventional wisdom. In particular, Scrub supports only a restricted form of joins. Its query execution strategy eschews imposing any overhead on the application hosts. In particular, joins, group-by operations and aggregations are sent to a dedicated centralized facility. In terms of implementation, Scrub avoids the overhead and security concerns of dynamic instrumentation. Finally, at all levels of the system, accuracy is traded for minimal impact on the hosts. We present the design and implementation of Scrub and contrast its choices to those made in earlier systems. We illustrate its power by describing a number of use cases, and we demonstrate its negligible overhead on the underlying application. On average, we observe a maximum CPU overhead of up to 2.5% on application hosts and a 1% increase in request latency. These overheads allow the advertisement bidding platform to operate well within its SLOs.
引用
收藏
页数:15
相关论文
共 46 条
[1]  
Acharya S, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P574, DOI 10.1145/304181.304581
[2]  
[Anonymous], 2003, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD'03, DOI DOI 10.1145/872757.872765
[3]  
[Anonymous], 2011, P NETDB, DOI DOI 10.1007/BF00640482
[4]  
Apache, AP KAFK HIGH THROUGH
[5]  
Apache, HBASE HAD DAT
[6]  
Apache, AP STORM DISTR REALT
[7]  
Arasu A., 2002, PROC PODS, P221
[8]  
Avnur R, 2000, SIGMOD REC, V29, P261, DOI 10.1145/335191.335420
[9]  
Ayad A.M., 2004, P 2004 ACM SIGMOD IN, P419, DOI [10.1145/1007568.1007616, DOI 10.1145/1007568.1007616]
[10]  
Babu S, 2001, SIGMOD REC, V30, P109, DOI 10.1145/603867.603884