Proactive Telemetry in Large-Scale Multi-Tenant Cloud Overlay Networks

被引:0
作者
Zhu, Shunmin [1 ,2 ]
Lu, Jianyuan [2 ]
Lyu, Biao [2 ,3 ]
Pan, Tian [2 ]
Zhang, Shize [2 ]
Sun, Xiaoqing [2 ]
Jia, Chenhao [2 ]
Cheng, Xin [2 ]
Kang, Daxiang [2 ]
Lv, Yilong [2 ]
Yang, Fukun [2 ]
Xue, Xiaobo [2 ]
Yang, Xihui [2 ]
Wang, Zhiliang [1 ,4 ,5 ]
Yang, Jiahai [1 ,4 ,5 ]
机构
[1] Tsinghua Univ, Inst Network Sci & Cyberspace, BNRist, Beijing 100084, Peoples R China
[2] Alibaba Grp, Hangzhou 310052, Peoples R China
[3] Zhejiang Univ, Coll Control Sci & Engn, Hangzhou 310027, Peoples R China
[4] Zhongguancun Lab, Beijing 100094, Peoples R China
[5] Quan Cheng Lab, Jinan 250103, Peoples R China
关键词
Telemetry; Cloud computing; Middleboxes; Topology; Probes; Network topology; Logic gates; Public cloud; vitrual network; proactive telemetry system;
D O I
10.1109/TNET.2024.3381786
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
At present, public clouds have served millions of tenants. To provide reliable services, cloud vendors need to perceive health status of the cloud network by building a telemetry system to detect possible network failures. While telemetry systems for physical networks have been extensively studied, research on telemetry systems for virtual networks is still insufficient. Different from physical networks, we conclude that building a virtual network telemetry system faces new challenges of feasibility, efficiency, and effectiveness. Specifically, we need to 1) protect privacy of tenants and adapt to heterogeneous middleboxes at the data plane; 2) handle frequent virtual network topology updates and compress large-scale measurement paths for millions of tenants at the control plane; 3) analyze telemetry results to locate network failures at the analysis plane. To address these challenges, we present Zoonet, a proactive virtual network telemetry system for multi-tenant clouds. At the data plane, Zoonet uses host agent and arp-ping to protect tenants' privacy and defines an elegant generalization of ping and traceroute, which can work on heterogeneous middleboxes. At the control plane, Zoonet conducts update batch processing and substantial probing path pruning to lessen the overhead. At the analysis plane, Zoonet reduces noises and aggregates alerts based on temporal and spatial correlation and conducts the hop-by-hop telemetry mode to locate failures. Zoonet has been deployed in Alibaba Cloud for over two years, covering tens of cloud regions, hundreds of thousands of servers. We become increasingly reliant on Zoonet as it reduces 86% of the personnel engaged in troubleshooting.
引用
收藏
页码:3002 / 3017
页数:16
相关论文
共 39 条
  • [1] Adams A., 2016, NETNORADTROUBLESHOOT
  • [2] Agmon Ben-Yehuda O., 2011, Proceedings of the 2011 IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom 2011), P304, DOI 10.1109/CloudCom.2011.48
  • [3] Amazon, 2021, ELASTIC NETWORK INTE
  • [4] Arzani B, 2018, PROCEEDINGS OF THE 15TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI'18), P419
  • [5] Reachability Analysis for AWS-Based Networks
    Backes, John
    Bayless, Sam
    Cook, Byron
    Dodge, Catherine
    Gacek, Andrew
    Hu, Alan J.
    Kahsai, Temesghen
    Kocik, Bill
    Kotelnikov, Evgenii
    Kukovec, Jure
    McLaughlin, Sean
    Reed, Jason
    Rungta, Neha
    Sizemore, John
    Stalzer, Mark
    Srinivasan, Preethi
    Subotic, Pavle
    Varming, Carsten
    Whaley, Blake
    [J]. COMPUTER AIDED VERIFICATION, CAV 2019, PT II, 2019, 11562 : 231 - 241
  • [6] PINT: Probabilistic In-band Network Telemetry
    Ben Basat, Ran
    Ramanathan, Sivaramakrishnan
    Li, Yuliang
    Antichi, Gianni
    Yu, Minlan
    Mitzenmacher, Michael
    [J]. SIGCOMM '20: PROCEEDINGS OF THE 2020 ANNUAL CONFERENCE OF THE ACM SPECIAL INTEREST GROUP ON DATA COMMUNICATION ON THE APPLICATIONS, TECHNOLOGIES, ARCHITECTURES, AND PROTOCOLS FOR COMPUTER COMMUNICATION, 2020, : 662 - 680
  • [7] Eisenbud DE, 2016, 13TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI '16), P523
  • [8] VTrace: Automatic Diagnostic System for Persistent Packet Loss in Cloud-Scale Overlay Network
    Fang, Chongrong
    Liu, Haoyu
    Miao, Mao
    Ye, Jie
    Wang, Lei
    Zhang, Wansheng
    Kang, Daxiang
    Lyv, Biao
    Cheng, Peng
    Chen, Jiming
    [J]. SIGCOMM '20: PROCEEDINGS OF THE 2020 ANNUAL CONFERENCE OF THE ACM SPECIAL INTEREST GROUP ON DATA COMMUNICATION ON THE APPLICATIONS, TECHNOLOGIES, ARCHITECTURES, AND PROTOCOLS FOR COMPUTER COMMUNICATION, 2020, : 31 - 43
  • [9] Geng YL, 2019, PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P549
  • [10] Gross Jesse, 2020, RFC 8926