Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

被引:15
|
作者
Yang, Renyu [1 ]
Zhang, Yang [2 ]
Garraghan, Peter [3 ]
Feng, Yihui [2 ]
Ouyang, Jin [2 ]
Xu, Jie [4 ]
Zhang, Zhuo [2 ]
Li, Chao [2 ]
机构
[1] Beihang Univ, BDBC Ctr, Beijing 100191, Peoples R China
[2] Alibaba Cloud Inc, Hangzhou, Zhejiang, Peoples R China
[3] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4WA, England
[4] Univ Leeds, Comp, Leeds, W Yorkshire, England
基金
英国工程与自然科学研究理事会;
关键词
Cloud computing; resource management; reliability; services; failover;
D O I
10.1109/TSC.2016.2544313
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed-an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage.
引用
收藏
页码:969 / 983
页数:15
相关论文
共 50 条
  • [31] An efficient, low-cost inconsistency detection framework for data and service sharing in an Internet-scale system
    Lu, YJ
    Jiang, H
    Feng, D
    ICEBE 2005: IEEE INTERNATIONAL CONFERENCE ON E-BUSINESS ENGINEERING, PROCEEDINGS, 2005, : 373 - 380
  • [32] Enhancing the integration of PV and coal-fired power plant for low-carbon, low-cost, and reliable power supply through various energy storage systems
    Shao, Yuhao
    Lin, Yangshu
    Yang, Chao
    Wang, Yifan
    Bao, Minglei
    Zhu, Yuankai
    Guo, Wenxuan
    Yan, Xinrong
    Zheng, Chenghang
    Gao, Xiang
    SUSTAINABLE ENERGY TECHNOLOGIES AND ASSESSMENTS, 2024, 69
  • [33] Development of measurement technologies for the low-cost, reliable, rapid, on-site determination of arsenic compounds in water: Problems, progress and prospects
    Tyson, Julian F.
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2006, 231
  • [34] RaspyLab: A Low-Cost Remote Laboratory to Learn Programming and Physical Computing Through Python']Python and Raspberry Pi
    Alvarez Ariza, Jonathan
    Gonzalez Gil, Sergio
    IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE-IEEE RITA, 2022, 17 (02): : 140 - 149
  • [35] A rapid soundscape analysis to quantify conservation benefits of temperate agroforestry systems using low-cost technology
    Bobryk, Christopher W.
    Rega-Brodsky, Christine C.
    Bardhan, Sougata
    Farina, Almo
    He, Hong S.
    Jose, Shibu
    AGROFORESTRY SYSTEMS, 2016, 90 (06) : 997 - 1008
  • [36] A rapid soundscape analysis to quantify conservation benefits of temperate agroforestry systems using low-cost technology
    Christopher W. Bobryk
    Christine C. Rega-Brodsky
    Sougata Bardhan
    Almo Farina
    Hong S. He
    Shibu Jose
    Agroforestry Systems, 2016, 90 : 997 - 1008
  • [37] Reliable and Low-Cost Digital Transformation Technology Using Progressive Web Apps in Fog Computing Architecture for Small and Medium Industries in Indonesia
    Tahir, Zulkifli
    Ilham, Amil Ahmad
    Alimuddin, Ais Prayogi
    Suyuti, Muhammad Zulfadly A.
    Charina
    ADVANCES IN INTERNET, DATA & WEB TECHNOLOGIES (EIDWT-2022), 2022, 118 : 163 - 174
  • [38] Low-Cost human and animal health diagnostics enabled through novel hybrid microfluidic systems
    Ramzy, Kelly
    Heist, Christopher
    Bandara, Gayan
    Pengpumkiat, Sumate
    Remcho, Vincent
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2017, 253
  • [39] Collaborative Mobile Edge Computing in eV2X: A Solution for Low-Cost Driver Assistance Systems
    Arghavan Keivani
    Farzad Ghayoor
    Jules-Raymond Tapamo
    Wireless Personal Communications, 2021, 118 : 1869 - 1882
  • [40] Collaborative Mobile Edge Computing in eV2X: A Solution for Low-Cost Driver Assistance Systems
    Keivani, Arghavan
    Ghayoor, Farzad
    Tapamo, Jules-Raymond
    WIRELESS PERSONAL COMMUNICATIONS, 2021, 118 (03) : 1869 - 1882