Cloud2HDD: Large-Scale HDD Data Analysis on Cloud for Cloud Datacenters

被引:0
作者
Zeydan, Engin [1 ]
Arslan, Suayb S. [2 ]
机构
[1] Ctr Technol Telecomunicac Catalunya, Barcelona 08860, Spain
[2] MEF Univ, Dept Comp Engn, TR-34912 Istanbul, Turkey
来源
2020 23RD CONFERENCE ON INNOVATION IN CLOUDS, INTERNET AND NETWORKS AND WORKSHOPS (ICIN 2020) | 2020年
关键词
Cloud; Data Center; Hadoop; HDDs; lifetime; machine learning;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The main focus of this paper is to develop a distributed large scale data analysis platform for the open-source data of Backblaze cloud datacenter which consists of operational hard disk drive (HDD) information collected over an observable period of 2272 days (over 74 months). To carefully analyze the intrinsic characteristics of the hard disk behavior, we have exploited a large bolume of data and the benefits of Hadoop ecosystem as our big data processing engine. In other words, we have utilized a special distributed scheme on cloud for cloud HDD data, which is termed as Cloud(2)HDD. To classify the remaining lifetime of hard disk drives based on health indicators such as in-built S.M.A.R.T (Self-Monitoring, Analysis, and Reporting Technology) features, we used some of the state-of-the-art classification algorithms and compared their accuracy, precision, and recall rates simultaneously. In addition, importance of various S.M.A.R.T. features in predicting the true remaining lifetime of HDDs are identified. For instance, our analysis results indicate that Random Forest Classifier (RFC) can yield up to 94% accuracy with the highest precision and recall at a reasonable time by classifying the remaining lifetime of drives into one of three different classes, namely critical, high and low ideal states in comparison to other classification approaches based on a specific subset of S.M.A.R.T. features.
引用
收藏
页码:243 / 249
页数:7
相关论文
共 19 条
  • [1] Apache Spark, 2019, AP SPARK UN AN ENG B
  • [2] Arfat Y, 2020, EAI SPRINGER INNOVAT, P453, DOI 10.1007/978-3-030-13705-2_19
  • [3] A data-assisted reliability model for carrier-assisted cold data storage systems
    Arslan, Suayb S.
    Peng, James
    Goker, Turguy
    [J]. RELIABILITY ENGINEERING & SYSTEM SAFETY, 2020, 196 (196)
  • [4] A Reliability Model for Dependent and Distributed MDS Disk Array Units
    Arslan, Suayb S.
    [J]. IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (01) : 133 - 148
  • [5] Predictive Models of Hard Drive Failures based on Operational Data
    Aussel, Nicolas
    Jaulin, Samuel
    Gandon, Guillaume
    Petetin, Yohan
    Fazli, Eriza
    Chabridon, Sophie
    [J]. 2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 619 - 625
  • [6] Backblaze, 2018, HARD DRIV DAT STATS
  • [7] Predicting Disk Replacement towards Reliable Data Centers
    Botezatu, Mirela
    Giurgiu, Ioana
    Bogojeska, Jasmina
    Wiesmann, Dorothea
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 39 - 48
  • [8] Hard Drive Failure Prediction Using Classification and Regression Trees
    Li, Jing
    Ji, Xinpu
    Jia, Yuhan
    Zhu, Bingpeng
    Wang, Gang
    Li, Zhongwei
    Liu, Xiaoguang
    [J]. 2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, : 383 - 394
  • [9] RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures
    Ma, Ao
    Traylor, Rachel
    Douglis, Fred
    Chamness, Mark
    Lu, Guanlin
    Sawyer, Darren
    Chandra, Surendar
    Hsu, Windsor
    [J]. ACM TRANSACTIONS ON STORAGE, 2015, 11 (04)
  • [10] Towards Self-Managing Cloud Storage with Reinforcement Learning
    Noel, Ridwan Rashid
    Mehra, Rohit
    Lama, Palden
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2019, : 34 - 44