Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

被引:15
|
作者
Yang, Renyu [1 ]
Zhang, Yang [2 ]
Garraghan, Peter [3 ]
Feng, Yihui [2 ]
Ouyang, Jin [2 ]
Xu, Jie [4 ]
Zhang, Zhuo [2 ]
Li, Chao [2 ]
机构
[1] Beihang Univ, BDBC Ctr, Beijing 100191, Peoples R China
[2] Alibaba Cloud Inc, Hangzhou, Zhejiang, Peoples R China
[3] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4WA, England
[4] Univ Leeds, Comp, Leeds, W Yorkshire, England
基金
英国工程与自然科学研究理事会;
关键词
Cloud computing; resource management; reliability; services; failover;
D O I
10.1109/TSC.2016.2544313
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed-an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage.
引用
收藏
页码:969 / 983
页数:15
相关论文
共 50 条
  • [41] On-farm assessment of grazing behaviour of dairy cows in two pasture management systems by low-cost and reliable cowtrackers
    Obermeyer, Kilian
    Kayser, Manfred
    SMART AGRICULTURAL TECHNOLOGY, 2023, 6
  • [42] 32-Bit One Instruction Core: A Low-Cost, Reliable, and Fault-Tolerant Core for Multicore Systems
    Venkatesha, Shashikiran
    Parthasarathi, Ranjani
    JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3941 - 3962
  • [43] Highly reliable and low-cost plastic modules of spot size converter integrated laser diodes for access network systems
    Oohashi, H
    Fukuda, M
    Kondo, Y
    Yamamoto, M
    Kadota, Y
    Ichikawa, F
    Kawaguchi, Y
    Kishi, K
    Sakai, Y
    Yanagisawa, M
    Ishibashi, S
    Hanawa, F
    Hashimoto, T
    Tohmori, Y
    Yokoyama, K
    Itaya, Y
    Toba, H
    IOOC-ECOC 97 - 11TH INTERNATIONAL CONFERENCE ON INTEGRATED OPTICS AND OPTICAL FIBRE COMMUNICATIONS / 23RD EUROPEAN CONFERENCE ON OPTICAL COMMUNICATIONS, VOL 3, 1997, (448): : 351 - 354
  • [44] Applications of low-cost environmental monitoring systems for fine-scale abiotic measurements in forest ecology
    Cannon, J. B.
    Warren, L. T.
    Ohlson, G. C.
    Hiers, J. K.
    Shrestha, M.
    Mitra, C.
    Hill, E. M.
    Bradfield, S. J.
    Ocheltree, T. W.
    AGRICULTURAL AND FOREST METEOROLOGY, 2022, 321
  • [45] Low-cost, large-scale energy storage systems with a mediator-ion solid electrolyte
    Manthiram, Arumugam
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256
  • [46] Fast and Low-Cost Testing of Advanced Driver Assistance Systems using Small-Scale Vehicles
    Rupp, Astrid
    Tranninger, Markus
    Wallner, Rafael
    Zubaca, Jasmina
    Steinberger, Martin
    Horn, Martin
    IFAC PAPERSONLINE, 2019, 52 (05): : 34 - 39
  • [47] A low-cost, rapid, sensitive and reliable PCR-based alternative method for predicting the presence of possible live microbial contaminants in food
    Mukhopadhyay, UK
    Mukhopadhyay, A
    CURRENT SCIENCE, 2002, 83 (01): : 53 - 56
  • [48] Towards the Development of Rapid and Low-Cost Pathogen Detection Systems Using Microfluidic Technology and Optical Image Processing
    Kerrouche, Abdelfateh
    Lithgow, Jordan
    Muhammad, Ilyas
    Romdhani, Imed
    APPLIED SCIENCES-BASEL, 2020, 10 (07):
  • [49] A low-cost trans-scale model for the collaborative analysis of the manufacturing and in-service process of unidirectional CFRP composites
    Zheng, Chensheng
    Chang, Xin
    Huang, Cheng
    Ren, Mingfa
    POLYMER COMPOSITES, 2025, 46 (04) : 3383 - 3401
  • [50] A Low-Cost UWB Pulse Generator for Medical Imaging, Through-wall Imaging and Surveillance Systems
    Thai-Singama, Richard
    Du-Burck, Frederic
    Piette, Marc
    2012 IEEE ASIA-PACIFIC CONFERENCE ON APPLIED ELECTROMAGNETICS (APACE), 2012, : 45 - 50