Cloud2HDD: Large-Scale HDD Data Analysis on Cloud for Cloud Datacenters

被引：0

作者：

Zeydan, Engin ^{[1
]}

Arslan, Suayb S. ^{[2
]}

机构：

[1] Ctr Technol Telecomunicac Catalunya, Barcelona 08860, Spain

[2] MEF Univ, Dept Comp Engn, TR-34912 Istanbul, Turkey

来源：

2020 23RD CONFERENCE ON INNOVATION IN CLOUDS, INTERNET AND NETWORKS AND WORKSHOPS (ICIN 2020) | 2020年

关键词：

Cloud; Data Center; Hadoop; HDDs; lifetime; machine learning;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The main focus of this paper is to develop a distributed large scale data analysis platform for the open-source data of Backblaze cloud datacenter which consists of operational hard disk drive (HDD) information collected over an observable period of 2272 days (over 74 months). To carefully analyze the intrinsic characteristics of the hard disk behavior, we have exploited a large bolume of data and the benefits of Hadoop ecosystem as our big data processing engine. In other words, we have utilized a special distributed scheme on cloud for cloud HDD data, which is termed as Cloud(2)HDD. To classify the remaining lifetime of hard disk drives based on health indicators such as in-built S.M.A.R.T (Self-Monitoring, Analysis, and Reporting Technology) features, we used some of the state-of-the-art classification algorithms and compared their accuracy, precision, and recall rates simultaneously. In addition, importance of various S.M.A.R.T. features in predicting the true remaining lifetime of HDDs are identified. For instance, our analysis results indicate that Random Forest Classifier (RFC) can yield up to 94% accuracy with the highest precision and recall at a reasonable time by classifying the remaining lifetime of drives into one of three different classes, namely critical, high and low ideal states in comparison to other classification approaches based on a specific subset of S.M.A.R.T. features.

引用

页码：243 / 249

页数：7

共 19 条

[1] Apache Spark, 2019, AP SPARK UN AN ENG B
[2] Arfat Y, 2020, EAI SPRINGER INNOVAT, P453, DOI 10.1007/978-3-030-13705-2_19
[3] A data-assisted reliability model for carrier-assisted cold data storage systems
Arslan, Suayb S.
Peng, James
Goker, Turguy
[J]. RELIABILITY ENGINEERING & SYSTEM SAFETY, 2020, 196 (196)
[4] A Reliability Model for Dependent and Distributed MDS Disk Array Units
Arslan, Suayb S.
[J]. IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (01) : 133 - 148
[5] Predictive Models of Hard Drive Failures based on Operational Data
Aussel, Nicolas
Jaulin, Samuel
Gandon, Guillaume
Petetin, Yohan
Fazli, Eriza
Chabridon, Sophie
[J]. 2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 619 - 625
[6] Backblaze, 2018, HARD DRIV DAT STATS
[7] Predicting Disk Replacement towards Reliable Data Centers
Botezatu, Mirela
Giurgiu, Ioana
Bogojeska, Jasmina
Wiesmann, Dorothea
[J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 39 - 48
[8] Hard Drive Failure Prediction Using Classification and Regression Trees
Li, Jing
Ji, Xinpu
Jia, Yuhan
Zhu, Bingpeng
Wang, Gang
Li, Zhongwei
Liu, Xiaoguang
[J]. 2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, : 383 - 394
[9] RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures
Ma, Ao
Traylor, Rachel
Douglis, Fred
Chamness, Mark
Lu, Guanlin
Sawyer, Darren
Chandra, Surendar
Hsu, Windsor
[J]. ACM TRANSACTIONS ON STORAGE, 2015, 11 (04)
[10] Towards Self-Managing Cloud Storage with Reinforcement Learning
Noel, Ridwan Rashid
Mehra, Rohit
Lama, Palden
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2019, : 34 - 44

← 1 2 →