MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry

被引:1
作者
Wang, Benran [1 ]
Chen, Hongyang [1 ]
Chen, Pengfei [1 ]
He, Zilong [1 ]
Yu, Guangba [1 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China
来源
PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023 | 2023年
基金
中国国家自然科学基金;
关键词
P4; In-band Network Telemetry; Fault Localization; Software Defined Network;
D O I
10.1145/3605573.3605622
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, the adoption of Software Defined Networking (SDN) as a network infrastructure has gained significant popularity. Although the openness and programmability of SDN ease the construction of large complex networks, it is still challenging to diagnose faults in a complex datacenter-scale network, which is crucial to guarantee rigorous service level agreement (SLA) of upper-layer applications. Previous network diagnosis tools incur significant overhead in fine-grained telemetry, and usually lack the ability to automatically diagnose fine-grained faults. Although on-demand monitoring methods is proposed to reduce telemetry overhead, they struggle to effectively set static thresholds, which requires expert experience. In this paper, we present MARS, a lightweight system for anomaly detection with dynamic threshold and automatic root cause localization in programmable networking systems. MARS collects aggregated packet-level telemetry on demand and generates a ranked list of fine-grained fault culprits at multiple levels, including port-level, switch-level, and flow-level. Experimental evaluations show the cost-effectiveness of MARS, both in terms of network bandwidth and switch memory usage. Moreover, MARS achieves a 0.97 F1 score in anomaly detection, and 0.95 Recall at Top-2 and an overall 0.3 Exam Score in root cause localization.
引用
收藏
页码:347 / 357
页数:11
相关论文
共 55 条
  • [1] Agarwal Kanak., 2014, P 3 WORKSHOP HOT TOP, P145
  • [2] [Anonymous], 2021, Chaosblade
  • [3] [Anonymous], 2006, Request for Comments RFC 4656, DOI DOI 10.17487/RFC4656
  • [4] [Anonymous], 2022, Bmv2
  • [5] [Anonymous], 2023, P4Runtime Control Plane API
  • [6] Spectrum-based fault localization in software product lines
    Arrieta, Aitor
    Segura, Sergio
    Markiegi, Urtzi
    Sagardui, Goiuria
    Etxeberria, Leire
    [J]. INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 100 : 18 - 31
  • [7] Ayres J., 2002, P 8 ACM SIGKDD INT C, P429, DOI DOI 10.1145/775047.775109
  • [8] barefoot, 2023, Open Tofino
  • [9] PINT: Probabilistic In-band Network Telemetry
    Ben Basat, Ran
    Ramanathan, Sivaramakrishnan
    Li, Yuliang
    Antichi, Gianni
    Yu, Minlan
    Mitzenmacher, Michael
    [J]. SIGCOMM '20: PROCEEDINGS OF THE 2020 ANNUAL CONFERENCE OF THE ACM SPECIAL INTEREST GROUP ON DATA COMMUNICATION ON THE APPLICATIONS, TECHNOLOGIES, ARCHITECTURES, AND PROTOCOLS FOR COMPUTER COMMUNICATION, 2020, : 662 - 680
  • [10] Benson T., 2010, P 10 ACM SIGCOMM C I, P267, DOI DOI 10.1145/1879141.1879175