Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

被引:15
|
作者
Yang, Renyu [1 ]
Zhang, Yang [2 ]
Garraghan, Peter [3 ]
Feng, Yihui [2 ]
Ouyang, Jin [2 ]
Xu, Jie [4 ]
Zhang, Zhuo [2 ]
Li, Chao [2 ]
机构
[1] Beihang Univ, BDBC Ctr, Beijing 100191, Peoples R China
[2] Alibaba Cloud Inc, Hangzhou, Zhejiang, Peoples R China
[3] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4WA, England
[4] Univ Leeds, Comp, Leeds, W Yorkshire, England
基金
英国工程与自然科学研究理事会;
关键词
Cloud computing; resource management; reliability; services; failover;
D O I
10.1109/TSC.2016.2544313
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed-an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage.
引用
收藏
页码:969 / 983
页数:15
相关论文
共 50 条
  • [1] Through Low-Cost Annotation to Reliable Parsing Evaluation
    Grac, Marek
    Jakubicek, Milos
    Kovar, Vojtech
    PROCEEDINGS OF THE 24TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, 2010, : 555 - +
  • [2] Through low-cost annotation to reliable parsing evaluation
    Grác, Marek
    Jakubíček, Miloš
    Kovář, Vojtěch
    PACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 2010, : 555 - 562
  • [3] A low-cost checkpointing scheme for mobile computing systems
    Li, GH
    Wang, HY
    Chen, JX
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 97 - 106
  • [4] Low-cost, rapid, and reliable screening for MEN1 mutations
    Nature Clinical Practice Endocrinology & Metabolism, 2006, 2 (7): : 362 - 362
  • [5] Are the low-cost, blood-glucose monitoring systems reliable?
    Fox, A
    Padilla, V
    Naraine, K
    DIABETOLOGIA, 2004, 47 : A346 - A346
  • [6] Highly Reliable and Low-Cost Symbiotic IOT Devices and Systems
    Lin, Bing-Yang
    Hung, Hsin-Wei
    Tseng, Shu-Mei
    Chen, Chi
    Wu, Cheng-Wen
    2017 IEEE INTERNATIONAL TEST CONFERENCE (ITC), 2017,
  • [7] Low-cost coordinated nonblocking checkpointing in mobile computing systems
    Ni, WG
    Vrbsky, SV
    Ray, S
    EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTERS AND COMMUNICATION, VOLS I AND II, PROCEEDINGS, 2003, : 1427 - 1434
  • [8] Low-cost checkpointing and failure recovery in mobile computing systems
    Prakash, R
    Singhal, M
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1996, 7 (10) : 1035 - 1048
  • [9] Low-cost checkpointing with mutable checkpoints in mobile computing systems
    Cao, GH
    Singhal, M
    18TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 1998, : 464 - 471
  • [10] A practical and low-cost test method to design reliable implantable systems
    Arabi, K
    Kaminska, B
    PROCEEDINGS OF THE 18TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOL 18, PTS 1-5, 1997, 18 : 165 - 166