Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems

被引:2
作者
Liu, Jinyang [1 ]
Jiang, Zhihan [1 ]
Gu, Jiazhen [1 ]
Huang, Junjie [1 ]
Chen, Zhuangbin [2 ]
Feng, Cong [3 ]
Yang, Zengyin [3 ]
Yang, Yongqiang [3 ]
Lyu, Michael R. [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Sun Yat Sen Univ, Sch Software Engn, Zhuhai, Peoples R China
[3] Huawei Cloud Comp Technol Co Ltd, Comp & Networking Innovat Lab, Huawei, Peoples R China
来源
2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE | 2023年
关键词
functional clusters; cloud observability; instances; cloud systems; software reliability;
D O I
10.1109/ASE56229.2023.00077
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers. Cloud systems often rely on virtualization techniques to create instances of hardware resources, such as virtual machines. However, virtualization hinders the observability of cloud systems, making it challenging to diagnose platform-level issues. To improve system observability, we propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, i.e., Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions instances into coarse-grained chunks based on communication patterns. Within each chunk, Prism further groups instances with similar resource usage patterns to produce fine-grained functional clusters. Such a design reduces noises in the data and allows Prism to process massive instances efficiently. We evaluate Prism on two datasets collected from the real-world production environment of Huawei Cloud. Our experiments show that Prism achieves a v-measure of similar to 0.95, surpassing existing state-of-the-art solutions. Additionally, we illustrate the integration of Prism within monitoring systems for enhanced cloud reliability through two real-world use cases.
引用
收藏
页码:268 / 280
页数:13
相关论文
共 47 条
[11]   Graph-based Incident Aggregation for Large-Scale Online Service Systems [J].
Chen, Zhuangbin ;
Liu, Jinyang ;
Su, Yuxin ;
Zhang, Hongyu ;
Wen, Xuemin ;
Ling, Xiao ;
Yang, Yongqiang ;
Lyu, Michael R. .
2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, :430-442
[12]   Towards Intelligent Incident Management: Why We Need It and How We Make It [J].
Chen, Zhuangbin ;
Kang, Yu ;
Li, Liqun ;
Zhang, Xu ;
Zhang, Hongyu ;
Xu, Hui ;
Zhou, Yangfan ;
Yang, Li ;
Sun, Jeffrey ;
Xu, Zhangwei ;
Dang, Yingnong ;
Gao, Feng ;
Zhao, Pu ;
Qiao, Bo ;
Lin, Qingwei ;
Zhang, Dongmei ;
Lyu, Michael R. .
PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20), 2020, :1487-1497
[13]  
cloud.google, Google cloud: Use vpc flow logs
[14]  
cloud.google, Cloud monitoring
[15]   HowBad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform [J].
Cotroneo, Domenico ;
De Simone, Luigi ;
Liguori, Pietro ;
Natella, Roberto ;
Bidokhti, Nematollah .
ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, :200-211
[16]   EFFICIENT ALGORITHM FOR A COMPLETE LINK METHOD [J].
DEFAYS, D .
COMPUTER JOURNAL, 1977, 20 (04) :364-366
[17]  
docs.aws.amazon, Logging ip traffic using vpc flow logs-amazon web services (aws)
[18]  
docs.aws.amazon, Amazon cloudwatch documentation
[19]  
docs.microsoft, Overview of azure monitor alerts-azure monitor
[20]  
en.wikipedia, Jaccard index-wikipedia