Fay: Extensible Distributed Tracing from Kernels to Clusters

被引:34
作者
Erlingsson, Ulfar
Peinado, Marcus [1 ]
Peter, Simon
Budiu, Mihai
Mainar-Ruiz, Gloria [1 ]
机构
[1] Microsoft Res, eXtreme Comp Grp, Redmond, WA 98052 USA
来源
ACM TRANSACTIONS ON COMPUTER SYSTEMS | 2012年 / 30卷 / 04期
关键词
Design; Experimentation; Languages; Measurement; Performance;
D O I
10.1145/2382553.2382555
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the lowest level, Fay can be safely extended with new tracing primitives, including even untrusted, fully optimized machine code, and Fay can be applied to running user-mode or kernel-mode software without compromising system stability. At the highest level, Fay provides a unified, declarative means of specifying what events to trace, as well as the aggregation, processing, and analysis of those events. We have implemented the Fay tracing platform for Windows and integrated it with two powerful, expressive systems for distributed programming. Our implementation is easy to use, can be applied to unmodified production systems, and provides primitives that allow the overhead of tracing to be greatly reduced, compared to previous dynamic tracing platforms. To show the generality of Fay tracing, we reimplement, in experiments, a range of tracing strategies and several custom mechanisms from existing tracing frameworks. Fay shows that modern techniques for high-level querying and data-parallel processing of disagreggated data streams are well suited to comprehensive monitoring of software execution in distributed systems. Revisiting a lesson from the late 1960s [Deutsch and Grant 1971], Fay also demonstrates the efficiency and extensibility benefits of using safe, statically verified machine code as the basis for low-level execution tracing. Finally, Fay establishes that, by automatically deriving optimized query plans and code for safe extensions, the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.
引用
收藏
页数:35
相关论文
共 68 条
[1]  
ANSEL J., 2011, P C PROGR LANG DES I
[2]  
AVGUSTINOV P., 2006, P C OBJ OR PROGR SYS
[3]  
BALAZINSKA M, 2005, P ACM SIGMOD INT C M
[4]  
BARHAM P., 2004, P C OP SYST DES IMPL
[5]  
BERSHAD B. N., 1995, P 5 WORKSH HOT TOP O
[6]  
BHATIA S., 2008, P C OP SYST DES IMPL
[7]  
BUNGALE P. P., 2007, P 3 INT ACM SIGPLAN
[8]  
BURROWS M., 2000, P INT C ARCH SUPP PR
[9]  
Cantrill B., 2004, P USENIX ANN TECHN C
[10]  
Cantrill B., 2006, ACM Queue, p26