Horus: Non-Intrusive Causal Analysis of Distributed Systems Logs

被引:1
|
作者
Neves, Francisco [1 ,2 ]
Machado, Nuno [1 ,3 ]
Vilaca, Ricardo [1 ,2 ]
Pereira, Jose [1 ,2 ]
机构
[1] INESC TEC, Braga, Portugal
[2] U Minho, Braga, Portugal
[3] Amazon, Madrid, Spain
来源
51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021) | 2021年
关键词
D O I
10.1109/DSN48987.2021.00035
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Logs are still the primary resource for debugging distributed systems executions. Complexity and heterogeneity of modern distributed systems, however, make log analysis extremely challenging. First, due to the sheer amount of messages, in which the execution paths of distinct system components appear interleaved. Second, due to unsynchronized physical clocks, simply ordering the log messages by timestamp does not suffice to obtain a causal trace of the execution. To address these issues, we present Horus, a system that enables the refinement of distributed system logs in a causally-consistent and scalable fashion. Horus leverages kernel-level probing to capture events for tracking causality between application-level logs from multiple sources. The events are then encoded as a directed acyclic graph and stored in a graph database, thus allowing the use of rich query languages to reason about runtime behavior. Our case study with TrainTicket, a ticket booking application with 40+ microservices, shows that Horus surpasses current widely-adopted log analysis systems in pinpointing the root cause of anomalies in distributed executions. Also, we show that Horus builds a causally-consistent log of a distributed execution with much higher performance (up to 3 orders of magnitude) and scalability than prior state-of-the-art solutions. Finally, we show that Horus' approach to query causality is up to 30 times faster than graph database built-in traversal algorithms.
引用
收藏
页码:212 / 223
页数:12
相关论文
共 50 条
  • [1] An efficient non-intrusive checkpointing algorithm for distributed database systems
    Wu, Jiang
    Manivarman, D.
    DISTRIBUTED COMPUTING AND NETWORKING, PROCEEDINGS, 2006, 4308 : 82 - 87
  • [2] Non-intrusive techniques for vulnerability assessment of services in distributed systems
    Genge, Bela
    Graur, Flavius
    Enachescu, Calin
    8TH INTERNATIONAL CONFERENCE INTERDISCIPLINARITY IN ENGINEERING, INTER-ENG 2014, 2015, 19 : 12 - 19
  • [3] Non-intrusive transaction monitoring using system logs
    Sengupta, Bikram
    Banerjee, Nilanjan
    Anandkumar, Animashree
    Bisdikian, Chatschik
    2008 IEEE NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, VOLS 1 AND 2, 2008, : 879 - +
  • [4] Non-intrusive minimum process synchronous checkpointing protocol for mobile distributed systems
    Kumar, P
    Kumar, L
    Chauhan, RK
    Gupta, VK
    2005 IEEE INTERNATIONAL CONFERENCE ON PERSONAL WIRELESS COMMUNICATIONS, 2005, : 491 - 495
  • [5] Intrusive and non-intrusive watermarking
    Hari, KVJ
    Ramakrishnan, KR
    2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL III, PROCEEDINGS, 2002, : 637 - 640
  • [6] High performance non-intrusive distributed CORBA monitoring
    Vermeulen, B
    De Reu, D
    Dhoedt, B
    Demeester, P
    2002 IEEE WORKSHOP ON IP OPERATIONS AND MANAGEMENT, 2002, : 116 - 120
  • [7] Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach
    Zhang, Yongle
    Makarov, Serguei
    Ren, Xiang
    Lion, David
    Yuan, Ding
    PROCEEDINGS OF THE TWENTY-SIXTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '17), 2017, : 19 - 33
  • [8] Non-Intrusive Protection for Legacy SCADA Systems
    Chan, Aldar C. -F.
    Zhou, Jianying
    IEEE COMMUNICATIONS MAGAZINE, 2023, 61 (06) : 36 - 42
  • [9] Non-intrusive Condition Monitoring for Manufacturing Systems
    Suzuki, Ryota
    Kohmoto, Shigeru
    Ogatsu, Toshinobu
    2017 25TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2017, : 1390 - 1394
  • [10] Non-intrusive BIST for systems-on-a-chip
    Chiusano, S
    Prinetto, P
    Wunderlich, HJ
    INTERNATIONAL TEST CONFERENCE 2000, PROCEEDINGS, 2000, : 644 - 651