Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems

被引:2
作者
Liu, Jinyang [1 ]
Jiang, Zhihan [1 ]
Gu, Jiazhen [1 ]
Huang, Junjie [1 ]
Chen, Zhuangbin [2 ]
Feng, Cong [3 ]
Yang, Zengyin [3 ]
Yang, Yongqiang [3 ]
Lyu, Michael R. [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Sun Yat Sen Univ, Sch Software Engn, Zhuhai, Peoples R China
[3] Huawei Cloud Comp Technol Co Ltd, Comp & Networking Innovat Lab, Huawei, Peoples R China
来源
2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE | 2023年
关键词
functional clusters; cloud observability; instances; cloud systems; software reliability;
D O I
10.1109/ASE56229.2023.00077
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers. Cloud systems often rely on virtualization techniques to create instances of hardware resources, such as virtual machines. However, virtualization hinders the observability of cloud systems, making it challenging to diagnose platform-level issues. To improve system observability, we propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, i.e., Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions instances into coarse-grained chunks based on communication patterns. Within each chunk, Prism further groups instances with similar resource usage patterns to produce fine-grained functional clusters. Such a design reduces noises in the data and allows Prism to process massive instances efficiently. We evaluate Prism on two datasets collected from the real-world production environment of Huawei Cloud. Our experiments show that Prism achieves a v-measure of similar to 0.95, surpassing existing state-of-the-art solutions. Additionally, we illustrate the integration of Prism within monitoring systems for enhanced cloud reliability through two real-world use cases.
引用
收藏
页码:268 / 280
页数:13
相关论文
共 47 条
[1]  
Arzani B, 2020, PROCEEDINGS OF THE 17TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P797
[2]   On the resemblance and containment of documents [J].
Broder, AZ .
COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, :21-29
[3]   Online Summarizing Alerts through Semantic and Behavior Information [J].
Chen, Jia ;
Wang, Peng ;
Wang, Wei .
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, :1646-1657
[4]   How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems [J].
Chen, Junjie ;
Zhang, Shu ;
He, Xiaoting ;
Lin, Qingwei ;
Zhang, Hongyu ;
Hao, Dan ;
Kang, Yu ;
Gao, Feng ;
Xu, Zhangwei ;
Dang, Yingnong ;
Zhang, Dongmei .
2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2020), 2020, :373-384
[5]   An Empirical Investigation of Incident Triage for Online Service Systems [J].
Chen, Junjie ;
He, Xiaoting ;
Lin, Qingwei ;
Xu, Yong ;
Zhang, Hongyu ;
Hao, Dan ;
Gao, Feng ;
Xu, Zhangwei ;
Dang, Yingnong ;
Zhang, Dongmei .
2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2019), 2019, :111-120
[6]   Identifying Linked Incidents in Large-Scale Online Service Systems [J].
Chen, Yujun ;
Yang, Xian ;
Dong, Hang ;
He, Xiaoting ;
Zhang, Hongyu ;
Lin, Qingwei ;
Chen, Junjie ;
Zhao, Pu ;
Kang, Yu ;
Gao, Feng ;
Xu, Zhangwei ;
Zhang, Dongmei .
PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20), 2020, :304-314
[7]   Outage Prediction and Diagnosis for Cloud Service Systems [J].
Chen, Yujun ;
Zhang, Hongyu ;
Yang, Xian ;
Lin, Qingwei ;
Zhang, Dongmei ;
Dong, Hang ;
Xu, Yong ;
Li, Hao ;
Kang, Yu ;
Gao, Feng ;
Xu, Zhangwei ;
Dang, Yingnong .
WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, :2659-2665
[8]   Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [J].
Chen, Zhuangbin ;
Liu, Jinyang ;
Su, Yuxin ;
Zhang, Hongyu ;
Ling, Xiao ;
Yang, Yongqiang ;
Lyu, Michael R. .
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, :61-72
[9]   Graph-based Incident Aggregation for Large-Scale Online Service Systems [J].
Chen, Zhuangbin ;
Liu, Jinyang ;
Su, Yuxin ;
Zhang, Hongyu ;
Wen, Xuemin ;
Ling, Xiao ;
Yang, Yongqiang ;
Lyu, Michael R. .
2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, :430-442
[10]   Towards Intelligent Incident Management: Why We Need It and How We Make It [J].
Chen, Zhuangbin ;
Kang, Yu ;
Li, Liqun ;
Zhang, Xu ;
Zhang, Hongyu ;
Xu, Hui ;
Zhou, Yangfan ;
Yang, Li ;
Sun, Jeffrey ;
Xu, Zhangwei ;
Dang, Yingnong ;
Gao, Feng ;
Zhao, Pu ;
Qiao, Bo ;
Lin, Qingwei ;
Zhang, Dongmei ;
Lyu, Michael R. .
PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20), 2020, :1487-1497